vendredi 23 juin 2017

Spliting an array into train and test sets with python

I tried a method to split data between train and test sets, but it seems that it fill the train with zeros and leave the data in test...

In theory, it works :

When I apply the following function which randomly selects some columns of the given array, it worked with the DataLens with numpy matrix but not with others.

def train_test_split(array):
    test = np.zeros(array.shape)
    train = array.copy()
    for user in xrange(array.shape[0]):
        test_ratings = np.random.choice(array[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]

    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

train, test = train_test_split(ratings)

With simple data it doesn't work :

When using simple data :

ratings :
[[ 1.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  1.]]

It fill the array with 0 one by one even if train was a copy of ratings at the very beginning :

train :  
 [[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]




Aucun commentaire:

Enregistrer un commentaire