I am trying to achieve an artificial dataset for a generic collaborative filtering model.
For simplicities sake, we can assume that it is a model for users and movies and the ratings each user gives a movie.
For generating the data, I did the following:
1. Give each user a random different value from 5 to 20.
2. All movies such that {movie_index % user_value == 0} were 'liked' by that particular user, \
ones with {movie_index % user_value == ±1} were 'disliked' and rest were 'neutral'.
> For a user with value 12, movie with the index 12 and 24 would be liked, and 13 would be disliked.
3. For actually assigning the ratings to movies, I used the following code:
if (movie.LikedBy(user)) then rating = np.random.normal(4,0.5)
elif (movie.DislikedBy(user)) then rating = np.random.normal(1,0.5)
else rating = np.random.normal(2.5,1)
The problems I'm facing is that the model is too eager to learn 2.5 as the suggested rating, and even in the data it's not uncommon to see 'liked' movies to be given a rating of 1...which I feel may be generating way too many outliers in our dataset.
Is there a better way to generate this data or is it better to just use an already existing dataset for movie ratings?
Aucun commentaire:
Enregistrer un commentaire