mercredi 22 mars 2017

Select clustered rows from a table

I'll try to write my problem in a list to be more understandable:

  • I have a matlab table T of size 1000x30.
  • All the data in the last column called 'Class' in the table has certain values of integers ranging from 1 to 20.
  • So some rows will have the value 1 which means these rows are of Class1 and some will have the value 2 and some will have the value 20 and so on.
  • The number of rows having a certain class are not equal to the number of rows having another class, so may be there are 100 rows have class 1 but 10 rows have class 2 and 500 have class 3 and so on.

This is what I want to do:

  • I want to get the number of rows with the class that have the smallest number of rows assigned to it. So let's say Class 10 has the least rows assigned to it with count == 3 while the rest of classes has more than 3 rows assigned to them.
  • I will then have a new column called YesNo where it will have only the values 0 or 1.
  • Then all rows of the class with the least count (e.g Class 10 in this example) will have the value 1.
  • For the rest of rows with all other classes, I want to randomly select from every other class a similar number of rows as the class with lowest number (in this example it will be 3).
  • Then for these randomly selected rows of each other class the value in the new column YesNo will be 1 while for the rest of the not chosen rows will be 0.
  • So in this example, this will ends up with a new column with 1000 values, where 3*20 of them will have 1's (3->number of rows assigned to class with lowest count, and 20->is number of classes) and 0 for the rest.

I wonder how this can be done in MATLAB R2015b? I know that I can create a new column in the table using T.YesNo = newArr; where newArr is a 1000x1 double having 0 and 1 values.

As a small example, if T is 10x3 and has only 3 classes (1,2,3), below is how T looks like:

ID  Name    Class   
0   'a'     3
1   'b'     2
2   'a'     2
3   'b'     2
4   'a'     3
5   'a'     1
6   'a'     1
7   'b'     2
8   'b'     1
9   'a'     2

So as shown above class3 is the one with the lowest count where only 2 rows. So I want to randomly select two rows of each class1 and class2 and then set the values of the new column of these randomly selected rows to 1 while the rest will be 0 as shown below:

ID  Name    Class   YesNo
0   'a'     3       1
1   'b'     2       0
2   'a'     2       1
3   'b'     2       0
4   'a'     3       1
5   'a'     1       0
6   'a'     1       1
7   'b'     2       0
8   'b'     1       1
9   'a'     2       1




Aucun commentaire:

Enregistrer un commentaire