I am working with R and have the following dataset which consists of sentences taken out of books and contains data about the book id, their cover colour (colour), and a sentence ID which is matched with the corresponding book.
My dataset
Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here
.
I would like to take four randomized subsamples (each containing 25% of the original data) with following conditions:
1) the distribution of book-colours should remain the same as in the original dataset. If there were 10% blue books, this should also be reflected in the subsamples
2) the subsample should not be taken/split by number of rows (which is the sentence ID) but by "Book ID". This means if Book ID 4 is sampled, then all sentences 7,8,9,10,11 should be in the sample dataset.
3) Also, each Book ID should only be in one of the 4 sub samples - this means if I decided to merge all 4 subsamples, I want to end up with the original dataset again.
What would be the best solution to split my dataset in the way described above?
Aucun commentaire:
Enregistrer un commentaire