Curious Insight


Technology, software, data science, machine learning, entrepreneurship, investing, and various other topics

Tags


Curious Insight

Deep Learning With Keras: Recommender Systems

29th April 2019

In this post we'll continue the series on deep learning by using the popular Keras framework to build a recommender system. This use case is much less common in deep learning literature than things like image classifiers or text generators, but may arguably be an even more common problem. In fact, as you'll see below, it's debatable whether this topic even qualifies as "deep learning" because we're going to see how to build a pretty good recommender system without using a neural network at all! We will, however, take advantage of the power of a modern computation framework like Keras to implement the recommender with minimal code. We'll try a couple different approaches using a technique called collaborative filtering. Finally we'll build a true neural network and see how it compares to the collaborative filtering approach.

The data used for this task is the MovieLens data set. As with the previous posts, much of this content is originally based on Jeremy Howard's excellent fast.ai lessons.

I've already saved the zip file to a local directory so we can get started with some imports and reading in the ratings.csv file, which is where the data for this task comes from.

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

PATH = '/home/paperspace/data/ml-latest-small/'

ratings = pd.read_csv(PATH + 'ratings.csv')
ratings.head()
userId	movieId	rating	timestamp
1	31	2.5	1260759144
1	1029	3.0	1260759179
1	1061	3.0	1260759182
1	1129	2.0	1260759185
1	1172	4.0	1260759205

The data is tabular and consists of a user ID, a movie ID, and a rating (there's also a timestamp but we won't use it for this task). Our task is to predict the rating for a user/movie pair, with the idea that if we had a model that's good at this task then we could predict how a user would rate movies they haven't seen yet and recommend movies with the highest predicted rating.

The zip file also includes a listing of movies and their associated genres. We don't actually need this for the model but it's useful to know about.

movies = pd.read_csv(PATH + 'movies.csv')
movies.head()

movieId	title	genres
1	Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2	Jumanji (1995) Adventure|Children|Fantasy
3	Grumpier Old Men (1995) Comedy|Romance
4	Waiting to Exhale (1995) Comedy|Drama|Romance
5	Father of the Bride Part II (1995) Comedy

To get a better sense of what the data looks like, we can turn it into a table by selecting the top 15 users/movies from the data and joining them together. The result shows how each of the top users rated each of the top movies.

g = ratings.groupby('userId')['rating'].count()
top_users = g.sort_values(ascending=False)[:15]

g = ratings.groupby('movieId')['rating'].count()
top_movies = g.sort_values(ascending=False)[:15]

top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='userId')
top_r = top_r.join(top_movies, rsuffix='_r', how='inner', on='movieId')

pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)
movieId	1	110	260	296	318	356	480	527	589	593	608	1196	1198	1270	2571
userId															
15	2.0	3.0	5.0	5.0	2.0	1.0	3.0	4.0	4.0	5.0	5.0	5.0	4.0	5.0	5.0
30	4.0	5.0	4.0	5.0	5.0	5.0	4.0	5.0	4.0	4.0	5.0	4.0	5.0	5.0	3.0
73	5.0	4.0	4.5	5.0	5.0	5.0	4.0	5.0	3.0	4.5	4.0	5.0	5.0	5.0	4.5
212	3.0	5.0	4.0	4.0	4.5	4.0	3.0	5.0	3.0	4.0	NaN	NaN	3.0	3.0	5.0
213	3.0	2.5	5.0	NaN	NaN	2.0	5.0	NaN	4.0	2.5	2.0	5.0	3.0	3.0	4.0
294	4.0	3.0	4.0	NaN	3.0	4.0	4.0	4.0	3.0	NaN	NaN	4.0	4.5	4.0	4.5
311	3.0	3.0	4.0	3.0	4.5	5.0	4.5	5.0	4.5	2.0	4.0	3.0	4.5	4.5	4.0
380	4.0	5.0	4.0	5.0	4.0	5.0	4.0	NaN	4.0	5.0	4.0	4.0	NaN	3.0	5.0
452	3.5	4.0	4.0	5.0	5.0	4.0	5.0	4.0	4.0	5.0	5.0	4.0	4.0	4.0	2.0
468	4.0	3.0	3.5	3.5	3.5	3.0	2.5	NaN	NaN	3.0	4.0	3.0	3.5	3.0	3.0
509	3.0	5.0	5.0	5.0	4.0	4.0	3.0	5.0	2.0	4.0	4.5	5.0	5.0	3.0	4.5
547	3.5	NaN	NaN	5.0	5.0	2.0	3.0	5.0	NaN	5.0	5.0	2.5	2.0	3.5	3.5
564	4.0	1.0	2.0	5.0	NaN	3.0	5.0	4.0	5.0	5.0	5.0	5.0	5.0	3.0	3.0
580	4.0	4.5	4.0	4.5	4.0	3.5	3.0	4.0	4.5	4.0	4.5	4.0	3.5	3.0	4.5
624	5.0	NaN	5.0	5.0	NaN	3.0	3.0	NaN	3.0	5.0	4.0	5.0	5.0	5.0	2.0

To build our first collaborative filtering model, we need to take care of a few things first. The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later). We can use scikit-learn's LabelEncoder class to transform the fields. We'll also create variables with the total number of unique users and movies in the data, as well as the min and max ratings present in the data, for reasons that will become apparent shortly.

user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()

item_enc = LabelEncoder()
ratings['movie'] = item_enc.fit_transform(ratings['movieId'].values)
n_movies = ratings['movie'].nunique()

ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])

n_users, n_movies, min_rating, max_rating
(671, 9066, 0.5, 5.0)

Create a traditional (X, y) pairing of data and label, then split the data into training and test data sets.

X = ratings[['user', 'movie']].values
y = ratings['rating'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((90003, 2), (10001, 2), (90003,), (10001,))

Another constant we'll need for the model is the number of factors per user/movie. This number can be whatever we want, however for the collaborative filtering model it does need to be the same size for both users and movies. When Jeremy covered this in his class, he said he played around with different numbers and 50 seemed to work best so we'll go with that.

Finally, we need to turn users and movies into separate arrays in the training and test data. This is because in Keras they'll each be defined as distinct inputs, and the way Keras works is each input needs to be fed in as its own array.

n_factors = 50

X_train_array = [X_train[:, 0], X_train[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]

Now we get to the model itself. The main idea here is we're going to use embeddings to represent each user and each movie in the data. These embeddings will be vectors (of size n_factors) that start out as random numbers but are fit by the model to capture the essential qualities of each user/movie. We can accomplish this by computing the dot product between a user vector and a movie vector to get a predicted rating. The code is fairly simple, there isn't even a traditional neural network layer or activation involved. I stuck some regularization on the embedding layers and used a different initializer but even that probably isn't necessary. Notice that this is where we need the number of unique users and movies, since those are required to define the size of each embedding matrix.

from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

def RecommenderV1(n_users, n_movies, n_factors):
    user = Input(shape=(1,))
    u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(user)
    u = Reshape((n_factors,))(u)
    
    movie = Input(shape=(1,))
    m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(movie)
    m = Reshape((n_factors,))(m)
    
    x = Dot(axes=1)([u, m])

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

This is kind of a neat example of how flexible and powerful modern computation frameworks like Keras and PyTorch are. Even though these are billed as deep learning libraries, they have the building blocks to quickly create any computation graph you want and get automatic differentiation essentially for free. Below you can see that all of the parameters are in the embedding layers, we don't have any traditional neural net components at all.

model = RecommenderV1(n_users, n_movies, n_factors)
model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 50)        33550       input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 1, 50)        453300      input_2[0][0]                    
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 50)           0           embedding_1[0][0]                
__________________________________________________________________________________________________
reshape_2 (Reshape)             (None, 50)           0           embedding_2[0][0]                
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1)            0           reshape_1[0][0]                  
                                                                 reshape_2[0][0]                  
==================================================================================================
Total params: 486,850
Trainable params: 486,850
Non-trainable params: 0
__________________________________________________________________________________________________

Let's go ahead and train this for a few epochs and see what we get.

history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))
Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 66us/step - loss: 9.7935 - val_loss: 3.4641
Epoch 2/5
90003/90003 [==============================] - 4s 49us/step - loss: 2.0427 - val_loss: 1.6521
Epoch 3/5
90003/90003 [==============================] - 4s 49us/step - loss: 1.1574 - val_loss: 1.3535
Epoch 4/5
90003/90003 [==============================] - 4s 48us/step - loss: 0.9027 - val_loss: 1.2607
Epoch 5/5
90003/90003 [==============================] - 4s 48us/step - loss: 0.7786 - val_loss: 1.2209

Not bad for a first try. We can make some improvements though. The first thing we can do is add a "bias" to each embedding. The concept is similar to the bias in a fully-connected layer or the intercept in a linear model. It just provides an extra degree of freedom. We can implement this idea using new embedding layers with a vector length of one. The bias embeddings get added to the result of the dot product.

The second improvement we can make is running the output of the dot product through a sigmoid layer and then scaling the result using the min and max ratings in the data. This is a neat technique that introduces a non-linearity into the output and results in a modest performance bump.

I also refactored the code a bit by pulling out the embedding layer and reshape operation into a separate class.

from keras.layers import Add, Activation, Lambda

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x

def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)

    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

The model summary shows the new graph. Notice the additional embedding layers with parameter numbers equal to the unique user and movie counts.

model = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_3 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 1, 50)        33550       input_3[0][0]                    
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 1, 50)        453300      input_4[0][0]                    
__________________________________________________________________________________________________
reshape_3 (Reshape)             (None, 50)           0           embedding_3[0][0]                
__________________________________________________________________________________________________
reshape_5 (Reshape)             (None, 50)           0           embedding_5[0][0]                
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 1, 1)         671         input_3[0][0]                    
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 1, 1)         9066        input_4[0][0]                    
__________________________________________________________________________________________________
dot_2 (Dot)                     (None, 1)            0           reshape_3[0][0]                  
                                                                 reshape_5[0][0]                  
__________________________________________________________________________________________________
reshape_4 (Reshape)             (None, 1)            0           embedding_4[0][0]                
__________________________________________________________________________________________________
reshape_6 (Reshape)             (None, 1)            0           embedding_6[0][0]                
__________________________________________________________________________________________________
add_1 (Add)                     (None, 1)            0           dot_2[0][0]                      
                                                                 reshape_4[0][0]                  
                                                                 reshape_6[0][0]                  
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 1)            0           add_1[0][0]                      
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 1)            0           activation_1[0][0]               
==================================================================================================
Total params: 496,587
Trainable params: 496,587
Non-trainable params: 0
__________________________________________________________________________________________________
history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))
Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 64us/step - loss: 1.2850 - val_loss: 0.9083
Epoch 2/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.7445 - val_loss: 0.7801
Epoch 3/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.5615 - val_loss: 0.7646
Epoch 4/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.4273 - val_loss: 0.7669
Epoch 5/5
90003/90003 [==============================] - 5s 58us/step - loss: 0.3298 - val_loss: 0.7823

Those two additions to the model resulted in a pretty sizable improvement. Validation error is now down to ~0.76 which is about as good as what Jeremy got (and I believe close to SOTA for this data set).

That pretty much covers the conventional approach to solving this problem, but there's another way we can tackle this. Instead of taking the dot product of the embedding vectors, what if we just concatenated the embeddings together and stuck a fully-connected layer on top of them? It's still not technically "deep" but it would at least be a neural network! To modify the code, we can remove the bias embeddings from V2 and do a concat on the embedding layers instead. Then we can add some dropout, insert a dense layer, and stick some dropout on the dense layer as well. Finally, we'll run it through a single-unit dense layer to keep the sigmoid trick at the end.

from keras.layers import Concatenate, Dense, Dropout

def RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    
    x = Concatenate()([u, m])
    x = Dropout(0.05)(x)
    
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    
    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

Most of the parameters are still in the embedding layers, but we have some added learning capability from the dense layers.

model = RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_5 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 1, 50)        33550       input_5[0][0]                    
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 1, 50)        453300      input_6[0][0]                    
__________________________________________________________________________________________________
reshape_7 (Reshape)             (None, 50)           0           embedding_7[0][0]                
__________________________________________________________________________________________________
reshape_8 (Reshape)             (None, 50)           0           embedding_8[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 100)          0           reshape_7[0][0]                  
                                                                 reshape_8[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 100)          0           concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 10)           1010        dropout_1[0][0]                  
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 10)           0           dense_1[0][0]                    
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 10)           0           activation_2[0][0]               
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            11          dropout_2[0][0]                  
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 1)            0           dense_2[0][0]                    
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 1)            0           activation_3[0][0]               
==================================================================================================
Total params: 487,871
Trainable params: 487,871
Non-trainable params: 0
__________________________________________________________________________________________________
history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))
Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 71us/step - loss: 0.9461 - val_loss: 0.8079
Epoch 2/5
90003/90003 [==============================] - 6s 64us/step - loss: 0.8097 - val_loss: 0.7898
Epoch 3/5
90003/90003 [==============================] - 6s 63us/step - loss: 0.7781 - val_loss: 0.7855
Epoch 4/5
90003/90003 [==============================] - 6s 64us/step - loss: 0.7617 - val_loss: 0.7820
Epoch 5/5
90003/90003 [==============================] - 6s 63us/step - loss: 0.7513 - val_loss: 0.7858

Without doing any tuning at all we still managed to get a result that's pretty close to the best performance we saw with the traditional approach. This technique has the added benefit that we can easily incorporate additional features into the model. For instance, we could create some date features from the timestamp or throw in the movie genres as a new embedding layer. We could tune the size of the movie and user embeddings independently since they no longer need to match. Lots of possibilities here.

Follow me on twitter to get new post updates.



Machine LearningData Science

Data scientist, engineer, author, investor, entrepreneur