mercredi 16 janvier 2019

R - Sample consecutive series of dates in time series without replacement?

I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:

library(tidyverse)
library(lubridate)

date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
                                  ymd("2015-07-03"),
                                  by = "days"),
                              seq(ymd("2015-08-09"),
                                  ymd("2015-10-01"),
                                  by = "days"),
                              seq(ymd("2015-11-12"),
                                  ymd("2016-01-03"),
                                  by = "days")),
                    sample_id = 0L)

I.e.:

> date_data
# A tibble: 211 x 2
   dates      sample_id
   <date>         <int>
 1 2015-03-22         0
 2 2015-03-23         0
 3 2015-03-24         0
 4 2015-03-25         0
 5 2015-03-26         0
 6 2015-03-27         0
 7 2015-03-28         0
 8 2015-03-29         0
 9 2015-03-30         0
10 2015-03-31         0
# … with 201 more rows

What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates column in my date_data data frame. Each sample would then get a unique (non-zero) number in the sample_id column in date_data such as 1:10.

To be clear, my requirements are:

  1. Each sample would be 10 consecutive days.

  2. The sampling has to be without replacement. So if sample_id == 1 is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.

  3. Each 10-day-long sample can't include any date that's not within date_data$dates.

At the end, date_data$sample_id would have unique numbers representing each 10-day-long sample, likely with lots of 0s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0).

I am aware of dplyr::sample_n() but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...

What's a good way to do this? A for loop?!?! Or perhaps something with purrr? Thank you very much for your help.




Aucun commentaire:

Enregistrer un commentaire