# Deep Learning With Keras: Recommender Systems

29th April 2019In this post we'll continue the series on deep learning by using the popular Keras framework to build a recommender system. This use case is much less common in deep learning literature than things like image classifiers or text generators, but may arguably be an even more common problem. In fact, as you'll see below, it's debatable whether this topic even qualifies as "deep learning" because we're going to see how to build a pretty good recommender system without using a neural network at all! We will, however, take advantage of the power of a modern computation framework like Keras to implement the recommender with minimal code. We'll try a couple different approaches using a technique called collaborative filtering. Finally we'll build a true neural network and see how it compares to the collaborative filtering approach.

The data used for this task is the MovieLens data set. As with the previous posts, much of this content is originally based on Jeremy Howard's excellent fast.ai lessons.

I've already saved the zip file to a local directory so we can get started with some imports and reading in the ratings.csv file, which is where the data for this task comes from.

```
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
PATH = '/home/paperspace/data/ml-latest-small/'
ratings = pd.read_csv(PATH + 'ratings.csv')
ratings.head()
```

userId movieId rating timestamp 1 31 2.5 1260759144 1 1029 3.0 1260759179 1 1061 3.0 1260759182 1 1129 2.0 1260759185 1 1172 4.0 1260759205

The data is tabular and consists of a user ID, a movie ID, and a rating (there's also a timestamp but we won't use it for this task). Our task is to predict the rating for a user/movie pair, with the idea that if we had a model that's good at this task then we could predict how a user would rate movies they haven't seen yet and recommend movies with the highest predicted rating.

The zip file also includes a listing of movies and their associated genres. We don't actually need this for the model but it's useful to know about.

```
movies = pd.read_csv(PATH + 'movies.csv')
movies.head()
```

movieId title genres 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 2 Jumanji (1995) Adventure|Children|Fantasy 3 Grumpier Old Men (1995) Comedy|Romance 4 Waiting to Exhale (1995) Comedy|Drama|Romance 5 Father of the Bride Part II (1995) Comedy

To get a better sense of what the data looks like, we can turn it into a table by selecting the top 15 users/movies from the data and joining them together. The result shows how each of the top users rated each of the top movies.

```
g = ratings.groupby('userId')['rating'].count()
top_users = g.sort_values(ascending=False)[:15]
g = ratings.groupby('movieId')['rating'].count()
top_movies = g.sort_values(ascending=False)[:15]
top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='userId')
top_r = top_r.join(top_movies, rsuffix='_r', how='inner', on='movieId')
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)
```

movieId 1 110 260 296 318 356 480 527 589 593 608 1196 1198 1270 2571 userId 15 2.0 3.0 5.0 5.0 2.0 1.0 3.0 4.0 4.0 5.0 5.0 5.0 4.0 5.0 5.0 30 4.0 5.0 4.0 5.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 4.0 5.0 5.0 3.0 73 5.0 4.0 4.5 5.0 5.0 5.0 4.0 5.0 3.0 4.5 4.0 5.0 5.0 5.0 4.5 212 3.0 5.0 4.0 4.0 4.5 4.0 3.0 5.0 3.0 4.0 NaN NaN 3.0 3.0 5.0 213 3.0 2.5 5.0 NaN NaN 2.0 5.0 NaN 4.0 2.5 2.0 5.0 3.0 3.0 4.0 294 4.0 3.0 4.0 NaN 3.0 4.0 4.0 4.0 3.0 NaN NaN 4.0 4.5 4.0 4.5 311 3.0 3.0 4.0 3.0 4.5 5.0 4.5 5.0 4.5 2.0 4.0 3.0 4.5 4.5 4.0 380 4.0 5.0 4.0 5.0 4.0 5.0 4.0 NaN 4.0 5.0 4.0 4.0 NaN 3.0 5.0 452 3.5 4.0 4.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 5.0 4.0 4.0 4.0 2.0 468 4.0 3.0 3.5 3.5 3.5 3.0 2.5 NaN NaN 3.0 4.0 3.0 3.5 3.0 3.0 509 3.0 5.0 5.0 5.0 4.0 4.0 3.0 5.0 2.0 4.0 4.5 5.0 5.0 3.0 4.5 547 3.5 NaN NaN 5.0 5.0 2.0 3.0 5.0 NaN 5.0 5.0 2.5 2.0 3.5 3.5 564 4.0 1.0 2.0 5.0 NaN 3.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 3.0 3.0 580 4.0 4.5 4.0 4.5 4.0 3.5 3.0 4.0 4.5 4.0 4.5 4.0 3.5 3.0 4.5 624 5.0 NaN 5.0 5.0 NaN 3.0 3.0 NaN 3.0 5.0 4.0 5.0 5.0 5.0 2.0

To build our first collaborative filtering model, we need to take care of a few things first. The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later). We can use scikit-learn's LabelEncoder class to transform the fields. We'll also create variables with the total number of unique users and movies in the data, as well as the min and max ratings present in the data, for reasons that will become apparent shortly.

```
user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()
item_enc = LabelEncoder()
ratings['movie'] = item_enc.fit_transform(ratings['movieId'].values)
n_movies = ratings['movie'].nunique()
ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])
n_users, n_movies, min_rating, max_rating
```

(671, 9066, 0.5, 5.0)

Create a traditional (X, y) pairing of data and label, then split the data into training and test data sets.

```
X = ratings[['user', 'movie']].values
y = ratings['rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```

((90003, 2), (10001, 2), (90003,), (10001,))

Another constant we'll need for the model is the number of factors per user/movie. This number can be whatever we want, however for the collaborative filtering model it does need to be the same size for both users and movies. When Jeremy covered this in his class, he said he played around with different numbers and 50 seemed to work best so we'll go with that.

Finally, we need to turn users and movies into separate arrays in the training and test data. This is because in Keras they'll each be defined as distinct inputs, and the way Keras works is each input needs to be fed in as its own array.

```
n_factors = 50
X_train_array = [X_train[:, 0], X_train[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]
```

Now we get to the model itself. The main idea here is we're going to use embeddings to represent each user and each movie in the data. These embeddings will be vectors (of size n_factors) that start out as random numbers but are fit by the model to capture the essential qualities of each user/movie. We can accomplish this by computing the dot product between a user vector and a movie vector to get a predicted rating. The code is fairly simple, there isn't even a traditional neural network layer or activation involved. I stuck some regularization on the embedding layers and used a different initializer but even that probably isn't necessary. Notice that this is where we need the number of unique users and movies, since those are required to define the size of each embedding matrix.

```
from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2
def RecommenderV1(n_users, n_movies, n_factors):
user = Input(shape=(1,))
u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(user)
u = Reshape((n_factors,))(u)
movie = Input(shape=(1,))
m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(movie)
m = Reshape((n_factors,))(m)
x = Dot(axes=1)([u, m])
model = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
return model
```

This is kind of a neat example of how flexible and powerful modern computation frameworks like Keras and PyTorch are. Even though these are billed as deep learning libraries, they have the building blocks to quickly create any computation graph you want and get automatic differentiation essentially for free. Below you can see that all of the parameters are in the embedding layers, we don't have any traditional neural net components at all.

```
model = RecommenderV1(n_users, n_movies, n_factors)
model.summary()
```

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ input_2 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, 1, 50) 33550 input_1[0][0] __________________________________________________________________________________________________ embedding_2 (Embedding) (None, 1, 50) 453300 input_2[0][0] __________________________________________________________________________________________________ reshape_1 (Reshape) (None, 50) 0 embedding_1[0][0] __________________________________________________________________________________________________ reshape_2 (Reshape) (None, 50) 0 embedding_2[0][0] __________________________________________________________________________________________________ dot_1 (Dot) (None, 1) 0 reshape_1[0][0] reshape_2[0][0] ================================================================================================== Total params: 486,850 Trainable params: 486,850 Non-trainable params: 0 __________________________________________________________________________________________________

Let's go ahead and train this for a few epochs and see what we get.

```
history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
verbose=1, validation_data=(X_test_array, y_test))
```

Train on 90003 samples, validate on 10001 samples Epoch 1/5 90003/90003 [==============================] - 6s 66us/step - loss: 9.7935 - val_loss: 3.4641 Epoch 2/5 90003/90003 [==============================] - 4s 49us/step - loss: 2.0427 - val_loss: 1.6521 Epoch 3/5 90003/90003 [==============================] - 4s 49us/step - loss: 1.1574 - val_loss: 1.3535 Epoch 4/5 90003/90003 [==============================] - 4s 48us/step - loss: 0.9027 - val_loss: 1.2607 Epoch 5/5 90003/90003 [==============================] - 4s 48us/step - loss: 0.7786 - val_loss: 1.2209

Not bad for a first try. We can make some improvements though. The first thing we can do is add a "bias" to each embedding. The concept is similar to the bias in a fully-connected layer or the intercept in a linear model. It just provides an extra degree of freedom. We can implement this idea using new embedding layers with a vector length of one. The bias embeddings get added to the result of the dot product.

The second improvement we can make is running the output of the dot product through a sigmoid layer and then scaling the result using the min and max ratings in the data. This is a neat technique that introduces a non-linearity into the output and results in a modest performance bump.

I also refactored the code a bit by pulling out the embedding layer and reshape operation into a separate class.

```
from keras.layers import Add, Activation, Lambda
class EmbeddingLayer:
def __init__(self, n_items, n_factors):
self.n_items = n_items
self.n_factors = n_factors
def __call__(self, x):
x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
embeddings_regularizer=l2(1e-6))(x)
x = Reshape((self.n_factors,))(x)
return x
def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
user = Input(shape=(1,))
u = EmbeddingLayer(n_users, n_factors)(user)
ub = EmbeddingLayer(n_users, 1)(user)
movie = Input(shape=(1,))
m = EmbeddingLayer(n_movies, n_factors)(movie)
mb = EmbeddingLayer(n_movies, 1)(movie)
x = Dot(axes=1)([u, m])
x = Add()([x, ub, mb])
x = Activation('sigmoid')(x)
x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
model = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
return model
```

The model summary shows the new graph. Notice the additional embedding layers with parameter numbers equal to the unique user and movie counts.

```
model = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()
```

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_3 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ input_4 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ embedding_3 (Embedding) (None, 1, 50) 33550 input_3[0][0] __________________________________________________________________________________________________ embedding_5 (Embedding) (None, 1, 50) 453300 input_4[0][0] __________________________________________________________________________________________________ reshape_3 (Reshape) (None, 50) 0 embedding_3[0][0] __________________________________________________________________________________________________ reshape_5 (Reshape) (None, 50) 0 embedding_5[0][0] __________________________________________________________________________________________________ embedding_4 (Embedding) (None, 1, 1) 671 input_3[0][0] __________________________________________________________________________________________________ embedding_6 (Embedding) (None, 1, 1) 9066 input_4[0][0] __________________________________________________________________________________________________ dot_2 (Dot) (None, 1) 0 reshape_3[0][0] reshape_5[0][0] __________________________________________________________________________________________________ reshape_4 (Reshape) (None, 1) 0 embedding_4[0][0] __________________________________________________________________________________________________ reshape_6 (Reshape) (None, 1) 0 embedding_6[0][0] __________________________________________________________________________________________________ add_1 (Add) (None, 1) 0 dot_2[0][0] reshape_4[0][0] reshape_6[0][0] __________________________________________________________________________________________________ activation_1 (Activation) (None, 1) 0 add_1[0][0] __________________________________________________________________________________________________ lambda_1 (Lambda) (None, 1) 0 activation_1[0][0] ================================================================================================== Total params: 496,587 Trainable params: 496,587 Non-trainable params: 0 __________________________________________________________________________________________________

```
history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
verbose=1, validation_data=(X_test_array, y_test))
```

Train on 90003 samples, validate on 10001 samples Epoch 1/5 90003/90003 [==============================] - 6s 64us/step - loss: 1.2850 - val_loss: 0.9083 Epoch 2/5 90003/90003 [==============================] - 5s 57us/step - loss: 0.7445 - val_loss: 0.7801 Epoch 3/5 90003/90003 [==============================] - 5s 57us/step - loss: 0.5615 - val_loss: 0.7646 Epoch 4/5 90003/90003 [==============================] - 5s 57us/step - loss: 0.4273 - val_loss: 0.7669 Epoch 5/5 90003/90003 [==============================] - 5s 58us/step - loss: 0.3298 - val_loss: 0.7823

Those two additions to the model resulted in a pretty sizable improvement. Validation error is now down to ~0.76 which is about as good as what Jeremy got (and I believe close to SOTA for this data set).

That pretty much covers the conventional approach to solving this problem, but there's another way we can tackle this. Instead of taking the dot product of the embedding vectors, what if we just concatenated the embeddings together and stuck a fully-connected layer on top of them? It's still not technically "deep" but it would at least be a neural network! To modify the code, we can remove the bias embeddings from V2 and do a concat on the embedding layers instead. Then we can add some dropout, insert a dense layer, and stick some dropout on the dense layer as well. Finally, we'll run it through a single-unit dense layer to keep the sigmoid trick at the end.

```
from keras.layers import Concatenate, Dense, Dropout
def RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating):
user = Input(shape=(1,))
u = EmbeddingLayer(n_users, n_factors)(user)
movie = Input(shape=(1,))
m = EmbeddingLayer(n_movies, n_factors)(movie)
x = Concatenate()([u, m])
x = Dropout(0.05)(x)
x = Dense(10, kernel_initializer='he_normal')(x)
x = Activation('relu')(x)
x = Dropout(0.5)(x)
x = Dense(1, kernel_initializer='he_normal')(x)
x = Activation('sigmoid')(x)
x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)
model = Model(inputs=[user, movie], outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
return model
```

Most of the parameters are still in the embedding layers, but we have some added learning capability from the dense layers.

```
model = RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()
```

__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_5 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ input_6 (InputLayer) (None, 1) 0 __________________________________________________________________________________________________ embedding_7 (Embedding) (None, 1, 50) 33550 input_5[0][0] __________________________________________________________________________________________________ embedding_8 (Embedding) (None, 1, 50) 453300 input_6[0][0] __________________________________________________________________________________________________ reshape_7 (Reshape) (None, 50) 0 embedding_7[0][0] __________________________________________________________________________________________________ reshape_8 (Reshape) (None, 50) 0 embedding_8[0][0] __________________________________________________________________________________________________ concatenate_1 (Concatenate) (None, 100) 0 reshape_7[0][0] reshape_8[0][0] __________________________________________________________________________________________________ dropout_1 (Dropout) (None, 100) 0 concatenate_1[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 10) 1010 dropout_1[0][0] __________________________________________________________________________________________________ activation_2 (Activation) (None, 10) 0 dense_1[0][0] __________________________________________________________________________________________________ dropout_2 (Dropout) (None, 10) 0 activation_2[0][0] __________________________________________________________________________________________________ dense_2 (Dense) (None, 1) 11 dropout_2[0][0] __________________________________________________________________________________________________ activation_3 (Activation) (None, 1) 0 dense_2[0][0] __________________________________________________________________________________________________ lambda_2 (Lambda) (None, 1) 0 activation_3[0][0] ================================================================================================== Total params: 487,871 Trainable params: 487,871 Non-trainable params: 0 __________________________________________________________________________________________________

```
history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
verbose=1, validation_data=(X_test_array, y_test))
```

Train on 90003 samples, validate on 10001 samples Epoch 1/5 90003/90003 [==============================] - 6s 71us/step - loss: 0.9461 - val_loss: 0.8079 Epoch 2/5 90003/90003 [==============================] - 6s 64us/step - loss: 0.8097 - val_loss: 0.7898 Epoch 3/5 90003/90003 [==============================] - 6s 63us/step - loss: 0.7781 - val_loss: 0.7855 Epoch 4/5 90003/90003 [==============================] - 6s 64us/step - loss: 0.7617 - val_loss: 0.7820 Epoch 5/5 90003/90003 [==============================] - 6s 63us/step - loss: 0.7513 - val_loss: 0.7858

Without doing any tuning at all we still managed to get a result that's pretty close to the best performance we saw with the traditional approach. This technique has the added benefit that we can easily incorporate additional features into the model. For instance, we could create some date features from the timestamp or throw in the movie genres as a new embedding layer. We could tune the size of the movie and user embeddings independently since they no longer need to match. Lots of possibilities here.