mardi 27 juin 2017

Split data based on grouping column

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       1234 |    1 | Foo                |    1 |
|       5678 |    0 | Bar                |    1 |
|    9101112 |    1 | Quack              |    1 |
|   13141516 |    1 | Meep               |    1 |
|       1234 |    0 | Boop               |    2 |
|       5678 |    0 | Baa                |    2 |
|    9101112 |    0 | Bleat              |    2 |
|   13141516 |    1 | Maaaa              |    2 |
|       1234 |    0 | Foo                |    3 |
|       5678 |    0 | Bar                |    3 |
|    9101112 |    1 | Quack              |    3 |
|   13141516 |    1 | Meep               |    3 |
|       1234 |    1 | Boop               |    4 |
|       5678 |    1 | Baa                |    4 |
|    9101112 |    0 | Bleat              |    4 |
|   13141516 |    1 | Maaaa              |    4 |
+------------+------+--------------------+------+

Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       1234 |    1 | Foo                |    1 |
|       1234 |    0 | Boop               |    2 |
|       1234 |    0 | Foo                |    3 |
|       1234 |    1 | Boop               |    4 |
|    9101112 |    1 | Quack              |    1 |
|    9101112 |    0 | Bleat              |    2 |
|    9101112 |    1 | Quack              |    3 |
|    9101112 |    0 | Bleat              |    4 |
+------------+------+--------------------+------+

and

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       5678 |    0 | Bar                |    1 |
|       5678 |    0 | Baa                |    2 |
|       5678 |    0 | Bar                |    3 |
|       5678 |    1 | Baa                |    4 |
|   13141516 |    1 | Meep               |    1 |
|   13141516 |    1 | Maaaa              |    2 |
|   13141516 |    1 | Meep               |    3 |
|   13141516 |    1 | Maaaa              |    4 |
+------------+------+--------------------+------+

Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.

I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.




Aucun commentaire:

Enregistrer un commentaire