I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
I.e.:
> date_data
# A tibble: 211 x 2
dates sample_id
<date> <int>
1 2015-03-22 0
2 2015-03-23 0
3 2015-03-24 0
4 2015-03-25 0
5 2015-03-26 0
6 2015-03-27 0
7 2015-03-28 0
8 2015-03-29 0
9 2015-03-30 0
10 2015-03-31 0
# … with 201 more rows
What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates
column in my date_data
data frame. Each sample would then get a unique (non-zero) number in the sample_id
column in date_data
such as 1:10
.
To be clear, my requirements are:
-
Each sample would be 10 consecutive days.
-
The sampling has to be without replacement. So if
sample_id == 1
is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample. -
Each 10-day-long sample can't include any date that's not within
date_data$dates
.
At the end, date_data$sample_id
would have unique numbers representing each 10-day-long sample, likely with lots of 0
s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0
).
I am aware of dplyr::sample_n()
but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...
What's a good way to do this? A for
loop?!?! Or perhaps something with purrr
? Thank you very much for your help.
Aucun commentaire:
Enregistrer un commentaire