I have a dataset comprising individuals with a certain illness:
library(lubridate)
id <- c(1:40)
birthDt <- ymd(c("1971-04-21", "1970-08-16", "1981-07-14", "1977-10-26", "1965-08-12", "1972-06-04", "1961-06-13", "1943-06-03", "1960-05-10",
"1973-01-28", "1980-10-15", "1964-06-02", "1974-09-23", "1972-10-23", "1981-10-05", "1959-08-21", "1940-09-26", "1975-02-08",
"1983-10-26", "1948-05-07", "1981-08-13", "1981-08-15", "1978-12-16", "1975-11-27", "1977-09-07", "1975-07-20", "1977-07-24",
"1976-06-06", "1981-09-27", "1973-08-17", "1978-11-05", "1978-10-19", "1965-02-22", "1963-09-02", "1970-11-05", "1971-06-14",
"1972-03-12", "1968-01-07", "1953-07-02", "1947-09-29"))
dxDt <- ymd(c("2000-08-30", "2000-05-01", "2000-01-14", "2000-01-17", "2000-01-25", "2000-01-19", "2000-01-11", "2000-01-13", "2000-01-27",
"2000-01-25", "2000-01-11", "2000-01-15", "2000-01-21", "2000-02-23",
"2000-01-26", "2000-01-30", "2000-01-24", "2000-02-07", "2000-01-04", "2000-02-09", "2000-08-01",
"2000-08-14", "2000-08-28", "2000-09-01", "2000-09-01", "2000-09-04", "2000-09-04", "2000-09-04",
"2000-09-12", "2000-09-26", "2000-10-02", "2000-10-04", "2000-05-31", "2000-07-20", "2000-08-04",
"2000-07-18", "2000-08-19", "2000-08-24", "2000-08-22", "2000-10-05"))
dxAge <- time_length(interval(birthDt, dxDt), "year")
intDt <- ymd(c("1999-08-12", "1999-10-15", "1999-12-15", "1999-12-29", "2000-01-17", "2000-01-19", "2000-02-02", "2000-02-02", "2000-02-07",
"2000-02-08", "2000-02-08", "2000-02-16", "2000-02-18", "2000-02-21", "2000-02-22", "2000-02-22", "2000-02-23", "2000-02-23",
"2000-02-25", "2000-02-25", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA"))
intAge <- time_length(interval(birthDt, intDt), "year")
enDt <- ymd(c("2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19",
"2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2006-08-12",
"2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19",
"2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19",
"2009-06-19", "2009-06-19", "2009-06-19", "2009-06-19"))
enAge <- time_length(interval(birthDt, enDt), "year")
df <- data.frame(id, birthDt, dxDt, dxAge, intDt, intAge, enDt, enAge)
head(df)
id birthDt dxDt dxAge intDt intAge enDt enAge
1 1 1971-04-21 2000-08-30 29.35890 1999-08-12 28.30874 2009-06-19 38.16164
2 2 1970-08-16 2000-05-01 29.70765 1999-10-15 29.16393 2009-06-19 38.84110
3 3 1981-07-14 2000-01-14 18.50273 1999-12-15 18.42077 2009-06-19 27.93151
4 4 1977-10-26 2000-01-17 22.22678 1999-12-29 22.17486 2009-06-19 31.64658
5 5 1965-08-12 2000-01-25 34.45355 2000-01-17 34.43169 2009-06-19 43.85205
6 6 1972-06-04 2000-01-19 27.62568 2000-01-19 27.62568 2009-06-19 37.04110
tail(df)
id birthDt dxDt dxAge intDt intAge enDt enAge
35 35 1970-11-05 2000-08-04 29.74590 <NA> NA 2009-06-19 38.61918
36 36 1971-06-14 2000-07-18 29.09315 <NA> NA 2009-06-19 38.01370
37 37 1972-03-12 2000-08-19 28.43836 <NA> NA 2009-06-19 37.27123
38 38 1968-01-07 2000-08-24 32.62842 <NA> NA 2009-06-19 41.44658
39 39 1953-07-02 2000-08-22 47.13973 <NA> NA 2009-06-19 55.96438
40 40 1947-09-29 2000-10-05 53.01644 <NA> NA 2009-06-19 61.72055
Where:
dxDtis the date each individual was diagnosedintDtis the date some individuals received a particular treatmentenDtis the end date for each individual
I would like to analyse the risk of negative outcomes pre- / post- intervention for the group of people that received treatment, and compare those risks to the group that did not receive treatment.
There is a strong relationship between the illness / outcomes and age. Therefore, I would like to match the groups on intAge. Since this value does not exist for those who did not receive treatment, I am looking to generate random dates (intDt) or ages (intAge) to enable a match.
The values should not exceed the individuals' enDt (i.e., they are censored) and should be normally distributed:
a <- mean(df$intAge, na.rm = TRUE)
a
[1] 32.07213
b <- sd(df$intAge, na.rm = TRUE)
b
[1] 12.5652
I tried using rtnorm from the msm package:
library(msm)
set.seed(1)
x <- rtnorm(id, mean = a, sd = b, lower = dxAge, upper = enAge)
but it seemed to be quite slow (even with this small sample) and had not finished its run after 10 minutes. Is this typical?
The function below appears to generate appropriate values for each person; however, I would like to generate values for those where intAge is equal to NA (i.e., intAge=="NA").
library(truncnorm)
set.seed(1)
x <- rtruncnorm(id, a = dxAge, b = enAge, mean = a, sd = b)
Applying:
set.seed(1)
y <- rtruncnorm(df$id[df$intAge=="NA"], a = df$dxAge, b = df$enAge, mean = a, sd = b)
also provides a value for each person.
How might I think about this correctly? Further, since some individuals (30% in this example) received treatment before they were diagnosed, is it possible to account for that variation in the generation of random values?
Any help would be greatly appreciated.
Aucun commentaire:
Enregistrer un commentaire