Curious Insight

The Book Of Five Rings

John Wittenauer — Mon, 23 Mar 2020 00:12:44 GMT

I'm a big believer in an idea that Nassim Taleb popularized in his Incerto series called the Lindy effect. Put simply, it's a theory that the future life expectancy of some non-perishable thing is roughly proportional to it's current age. The world has had smartphones for around 10 years so we can reasonably estimate them to be around for another 10 years or so before something replaces them. Conversely (and this really illustrates the point), the technology we call a "chair" has been around for thousands of years, so one should expect that chairs will be relevant for a long time to come. It's easy to think of scenarios where this rule of thumb fails of course, but that's not the point. It's not a statistical prediction but rather an observation about the nature of things which exist in the world but do not age (an "indicator of robustness" as Taleb put it).

The Lindy effect applies to things like technologies, but also ideas, and by extension the modalities that we use to communicate ideas (i.e. books). I've been spending a lot more time lately thinking about the age of the books I'm reading. It's easy to focus on newer books because there are so many coming out, they are well-marketed, have catchy titles, get discussed a lot on podcasts etc. But if the Lindy effect holds, it means that most of the best books in history were written a long time ago. After incorporating this knowledge into my book selection strategy, I began seeking out old but enduring books that are still in print. Such is how I came to read Miyamoto Musashi's masterpiece, "The Book of Five Rings".

A typical biographical description of Musashi would begin by noting that he was a Japanese swordsman who lived in the 1600s. But he wasn't just any swordsman - he is arguably the greatest swordsman who ever lived. Musashi famously went undefeated in over 60 duels throughout his life (many of them to the death), a streak that no one else ever come close to matching. His legend has been passed down through generations and remains deeply embedded in Japanese culture to this day.

In the later years of his life, Musashi wrote "The Book Of Five Rings" to codify the two-sword martial arts style he had spent his life mastering. Although the book is principally about the details of his study of martial arts, its true value is much broader and much deeper. Look beyond the surface and there's a life philosophy (which Musashi calls "The Way") that one can gain insights from that apply to all aspects of our existence.

In the book, the five rings correspond to the five "books" or sections of the text which are meant to refer to the idea that there are different elements in battle, just as there are different physical elements in life. Musashi named them Earth, Water, Fire, Wind, and Emptiness. Below are some of my notes from each section. One could describe the language Musashi uses as cryptic and hard to interpret, but to me that's sort of the point. As I read through the book, I couldn't help but feel as though every statement has a deeper meaning that would only be revealed to me upon careful, deliberate reflection. This process is still ongoing.

While I tried to capture the essence of the text in very short, concise passages, there is inevitably a great deal that was missed. Consider these notes a starting point to uncovering the wisdom embedded in Musashi's work.

Earth

The “Way” of something is a learned discipline or philosophy (i.e. the Way of Buddhism, the Way of the Carpenter). The Way of Martial Arts, which Musashi refers to as “Two Heavens, One Style”, is to learn skills that are useful in all things. The Way of the Martial Arts is a mastery of one’s craft similar to carpentry, of which the sword is the essential martial art. There is a rhythm to everything. There is rhythm in the formless. Victory is in knowing the rhythm of your opponent, in using a rhythm that is hard to grasp, and in developing a rhythm of emptiness rather than wisdom.

Water

Think deeply about the principles written in the book as though you discovered them yourself. Make them part of yourself. The mind should be centered, swaying peacefully. Be watchful of the mind and do not let it become clouded. Sharpen your wisdom. Learn the good and bad of all things. With every grip, stance, strike, do not think of the action itself. Think only about cutting down your opponent. With practice you will gradually grasp the principle of the Way.

Fire

There are three initiatives to understand in order to defeat an opponent – Initiative of Attack, Initiative of Waiting, and the Body-Body Initiative. Knowing the conditions in which you find yourself means clearly observing your opponent and grasping the way to victory with certainty. Become your opponent. Move the shadow. Control the light. Impose fear. Cause confusion. Do not use the same tactic repeatedly. The true Way of swordsmanship is to fight with your opponent and win.

Wind

The True Way does not prefer a long or short sword, a forceful or weak stroke, specialize in a stance, or fix the eyes on a particular gaze. It is not fast or slow, prefer interior or exterior positions, or dictate how to move your feet. There is no “best” in any of these things. There is only seeing through to its virtues with the mind.

Emptiness

The heart of Emptiness is in the absence of anything with form and the inability to have knowledge thereof. Knowing the existent, you know the nonexistent. A warrior learns the way with certainty. He has no confusion in his mind and is never lazy. He polishes his mind and will, and sharpens the two eyes of broad observation and focused vision. He clears away the clouds of confusion. In Emptiness exists Good but no Evil. Wisdom is Existence. Principle is Existence. The Way is Existence. The Mind is Emptiness.

Deep Learning With Keras: Recurrent Networks

John Wittenauer — Sun, 25 Aug 2019 13:27:21 GMT

This post is the fourth in a series on deep learning using Keras. We've already looked at dense networks with category embeddings, convolutional networks, and recommender systems. For this installment we're going to use recurrent networks to create a character-level language model for text generation. We'll start with a simple fully-connected network and show how it can be used as an "unrolled" recurrent layer, then gradually build up from there until we have a model capable of generating semi-reasonable sounding text. Much of this content is based on Jeremy Howard's fast.ai lessons. However, we'll use Keras instead of PyTorch and build out all of the code from scratch rather than relying on the fast.ai library.

The text corpus we're using for this task are the works of the philosopher Nietzsche. The whole corpus can be found here. Let's start by loading the data into memory and taking a peek at the beginning of the text.

%matplotlib inline
import io
import numpy as np
import keras
from keras.utils.data_utils import get_file

path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()

len(text)

text[:400]

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to truth, have been unskilled and unseemly methods for\nwinning a woman? certainly she has never allowed herself '

Now get the unique set of characters that appear in the text. This is our vocabulary.

chars = sorted(list(set(text)))
vocab_size = len(chars)
vocab_size

''.join(chars)

'\n !"\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwxyzäæéë'

Let's create a dictionary that maps each unique character to an integer, which is what we'll feed into the model. The actual integer used isn't important, it just has to be unique (here we just take the index from the "chars" list above). It's also useful to have a reverse mapping to get back to characters in order to do something with the model output. Finally, create a "mapped" corpus where each character in the data has been replaced with its corresponding integer.

char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}
idx = [char_indices[c] for c in text]

idx[:20]

[42, 44, 31, 32, 27, 29, 31, 0, 0, 0, 45, 47, 42, 42, 41, 45, 35, 40, 33, 1]

We can convert from integers back to characters using something like this.

''.join(indices_char[i] for i in idx[:100])

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

For our first attempt, we'll build a model that accepts a 3-character sequence as input and tries to predict the following character in the text. For simplicity, we can just manually create each character sequence. Start by creating lists that take every 3rd character, offset by some amount between 0 and 3.

cs = 3
c1 = [idx[i] for i in range(0, len(idx) - cs, cs)]
c2 = [idx[i + 1] for i in range(0, len(idx) - cs, cs)]
c3 = [idx[i + 2] for i in range(0, len(idx) - cs, cs)]
c4 = [idx[i + 3] for i in range(0, len(idx) - cs, cs)]

This just converts the lists to numpy arrays. Notice that this approach resulted in non-overlapping sequences, i.e. we use characters 0-2 to predict character 3, then characters 3-5 to predict character 6, etc. That's why the array shape is about 1/3 the size of the original text. We'll see how to improve on this later.

x1 = np.stack(c1)
x2 = np.stack(c2)
x3 = np.stack(c3)
y = np.stack(c4)

x1.shape, y.shape

((200297,), (200297,))

Our model will use embeddings to represent each character. This is why we converted them to integers before - each integer gets turned into a vector in the embedding layer. Set some variables for the embedding vector size and the number of hidden units to use in the model. Finally, we need to convert the target variable to a one-hot character encoding. This is because the model outputs a probability for each character, and in order to score this properly it needs to be able to compare that output with an array that's structured the same way.

n_factors = 42
n_hidden = 256
y_cat = keras.utils.to_categorical(y)

y_cat.shape

(200297, 56)

Now we get to the first iteration of our model. The way I've structred this is by defining the layers of the model so that they can be re-used across multiple inputs. For example, rather than create an embedding layer for each of the three character inputs, we're instead creating one embedding layer and sharing it. This is a reasonable approach to handling sequences since each input comes from an identical distribution.

The next thing to observe is the part where h is defined. The first character is fed through the hidden layer like normal, but the other characters in the sequence are doing something different. We're re-using the same layer, but instead of just taking the character as input, we're using the character + the previous output h. This is the "hidden" state of the model. I think about it in the following way: "give me the output of this layer for character c conditioned on the fact that these other characters (represented by h) came before it".

You'll notice that there's no use of an RNN class at all. Basically what's going on here is we're implmenting an "unrolled" RNN from scratch on our own.

from keras import backend as K
from keras.models import Model
from keras.layers import add
from keras.layers import Input, Reshape, Dense, Add
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam

def Char3Model(vocab_size, n_factors, n_hidden):
    embed_layer = Embedding(vocab_size, n_factors)
    reshape_layer = Reshape((n_factors,))
    input_layer = Dense(n_hidden, activation='relu')
    hidden_layer = Dense(n_hidden, activation='tanh')
    output_layer = Dense(vocab_size - 1, activation='softmax')

    in1 = Input(shape=(1,))
    in2 = Input(shape=(1,))
    in3 = Input(shape=(1,))

    c1 = input_layer(reshape_layer(embed_layer(in1)))
    c2 = input_layer(reshape_layer(embed_layer(in2)))
    c3 = input_layer(reshape_layer(embed_layer(in3)))

    h = hidden_layer(c1)
    h = hidden_layer(add([h, c2]))
    h = hidden_layer(add([h, c3]))

    out = output_layer(h)

    model = Model(inputs=[in1, in2, in3], outputs=out)
    opt = Adam(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=opt)

    return model

Train the model for a few iterations.

model = Char3Model(vocab_size, n_factors, n_hidden)
history = model.fit(x=[x1, x2, x3], y=y_cat, batch_size=512, epochs=3, verbose=1)

Epoch 1/3
200297/200297 [==============================] - 4s 20us/step - loss: 2.4007
Epoch 2/3
200297/200297 [==============================] - 2s 10us/step - loss: 2.0852
Epoch 3/3
200297/200297 [==============================] - 2s 10us/step - loss: 1.9470

In order to make sense of the model's output, we need a helper function that converts the character probability array that it returns into an actual character. This is where the reverse lookup table we created earlier comes in handy!

def get_next_char(model, s):
    idxs = [np.array([char_indices[c]]) for c in s]
    pred = model.predict(idxs)
    char_idx = np.argmax(pred)
    return chars[char_idx]

get_next_char(model, ' th')

'e'

get_next_char(model, 'and')

' '

It appears to be spitting out sensible results. The 3-character approach is very limiting though. That's not enough context for even a full word most of the time. For our next step, let's expand the input window to 8 characters. We can create an input array using some list comprehension magic to output a list of lists, then stacking them together into an array. Try experimenting with the logic below yourself to get a better sense of what it's doing. The target array is created in a similar manner as before.

cs = 8

c_in = [[idx[i + j] for i in range(cs)] for j in range(len(idx) - cs)]
c_out = [idx[j + cs] for j in range(len(idx) - cs)]

X = np.stack(c_in, axis=0)
y = np.stack(c_out)

Notice this time we're making better use of our data by making the sequences overlapping. For example, the first "row" in the data uses characters 0-7 to predict character 8. The next "row" uses characters 1-8 to predict character 9, and so on. We just increment by one each time. It does create a lot of duplicate data, but that's not a huge issue with a corpus of this size.

X.shape, y.shape

((600885, 8), (600885,))

It helps to look at an example to see how the data is formatted. Each row is a sequence of 8 characters from the text. As you go down the rows it's apparent they're offset by one character.

X[:cs, :cs]

array([[42, 44, 31, 32, 27, 29, 31,  0],
       [44, 31, 32, 27, 29, 31,  0,  0],
       [31, 32, 27, 29, 31,  0,  0,  0],
       [32, 27, 29, 31,  0,  0,  0, 45],
       [27, 29, 31,  0,  0,  0, 45, 47],
       [29, 31,  0,  0,  0, 45, 47, 42],
       [31,  0,  0,  0, 45, 47, 42, 42],
       [ 0,  0,  0, 45, 47, 42, 42, 41]])

y[:cs]

array([ 0,  0, 45, 47, 42, 42, 41, 45])

Since we have separate inputs for each character, Keras expects separate arrays rather than one big array. Also need to one-hot encode the target again.

X_array = [X[:, i] for i in range(X.shape[1])]
y_cat = keras.utils.to_categorical(y)

The 8-character model works exactly the same way as the 3-character model, there are just more of the same steps. Rather than write them all out in code, I converted it to a loop. Again, this is almost exactly the way an RNN works under the hood.

def CharLoopModel(vocab_size, n_chars, n_factors, n_hidden):
    embed_layer = Embedding(vocab_size, n_factors)
    reshape_layer = Reshape((n_factors,))
    input_layer = Dense(n_hidden, activation='relu')
    hidden_layer = Dense(n_hidden, activation='tanh')
    output_layer = Dense(vocab_size, activation='softmax')
    
    inputs = []
    for i in range(n_chars):
        inp = Input(shape=(1,))
        inputs.append(inp)
        c = input_layer(reshape_layer(embed_layer(inp)))
        if i == 0:
            h = hidden_layer(c)
        else:
            h = hidden_layer(add([h, c]))

    out = output_layer(h)

    model = Model(inputs=inputs, outputs=out)
    opt = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=opt)

    return model

Train the model a bit and generate some predictions.

model = CharLoopModel(vocab_size, cs, n_factors, n_hidden)
history = model.fit(x=X_array, y=y_cat, batch_size=512, epochs=5, verbose=1)

Epoch 1/5
600885/600885 [==============================] - 9s 15us/step - loss: 2.1838
Epoch 2/5
600885/600885 [==============================] - 8s 13us/step - loss: 1.7245
Epoch 3/5
600885/600885 [==============================] - 8s 13us/step - loss: 1.5888
Epoch 4/5
600885/600885 [==============================] - 8s 13us/step - loss: 1.5200
Epoch 5/5
600885/600885 [==============================] - 8s 13us/step - loss: 1.4781

get_next_char(model, 'for thos')

'e'

get_next_char(model, 'queens a')

'n'

Now we're ready to replace the loop with a real recurrent layer. The first thing to notice is that we no longer need to create separate inputs for each step in the sequence - recurrent layers in Keras are designed to accept 3-dimensional arrays where the 2nd dimension is the number of timesteps. We just need to add an extra dimension to the input shape with the number of characters.

The second wrinkle is the use of the "TimeDistributed" class on the embedding layer. Just as with the input, this is another more convenient way of doing what we were already doing by defining and re-using layers. Wrapping a layer with "TimeDistributed" basically says "apply this to every timestep in the array". Like the RNN, it expects (and returns) a 3-dimensional array. The reshape operation is the same story, we just add another dimension to it. The RNN layer itself is very straightforward.

from keras.layers import TimeDistributed, SimpleRNN

def CharRnn(vocab_size, n_chars, n_factors, n_hidden):
    i = Input(shape=(n_chars, 1))
    x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
    x = Reshape((n_chars, n_factors))(x)
    x = SimpleRNN(n_hidden, activation='tanh')(x)
    x = Dense(vocab_size, activation='softmax')(x)

    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=opt)

    return model

Let's look at a summary of the model. Notice the array shapes have a third dimension to them until we get on the other side of the RNN.

model = CharRnn(vocab_size, cs, n_factors, n_hidden)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_12 (InputLayer)        (None, 8, 1)              0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 8, 1, 42)          2394      
_________________________________________________________________
reshape_3 (Reshape)          (None, 8, 42)             0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 256)               76544     
_________________________________________________________________
dense_7 (Dense)              (None, 57)                14649     
=================================================================
Total params: 93,587
Trainable params: 93,587
Non-trainable params: 0
_________________________________________________________________

Reshape the input to match the 3-dimensional input format of (rows, timesteps, features). Since we only have one feature, the last dimension is trivially set to one.

X = X.reshape((X.shape[0], cs, 1))
X.shape

(600885, 8, 1)

Train the model for a bit. Notice that the loss looks almost identical to the last model! All we really did is shuffle things around to take advantage of some built-in classes that Keras provides. The model structure and performance should look no different than before.

history = model.fit(x=X, y=y_cat, batch_size=512, epochs=5, verbose=1)

Epoch 1/5
600885/600885 [==============================] - 11s 18us/step - loss: 2.2863
Epoch 2/5
600885/600885 [==============================] - 11s 18us/step - loss: 1.8356
Epoch 3/5
600885/600885 [==============================] - 10s 17us/step - loss: 1.6601
Epoch 4/5
600885/600885 [==============================] - 10s 17us/step - loss: 1.5672
Epoch 5/5
600885/600885 [==============================] - 10s 17us/step - loss: 1.5102

We can train it a bit longer at a lower learning rate to reduce the loss further.

K.set_value(model.optimizer.lr, 0.0001)
history = model.fit(x=X, y=y_cat, batch_size=512, epochs=3, verbose=1)

Epoch 1/3
600885/600885 [==============================] - 10s 17us/step - loss: 1.4350
Epoch 2/3
600885/600885 [==============================] - 10s 17us/step - loss: 1.4233
Epoch 3/3
600885/600885 [==============================] - 10s 17us/step - loss: 1.4175

def get_next_char(model, s):
    idxs = np.array([char_indices[c] for c in s])
    idxs = idxs.reshape((1, idxs.shape[0], 1))
    pred = model.predict(idxs)
    char_idx = np.argmax(pred)
    return chars[char_idx]

get_next_char(model, 'for thos')

'e'

Since the model is getting better, we can now try to generate more than one character of text. All we need is an initial seed of 8 characters and it can go on as long as we like. To do this, we'll create a simple helper function that continuously predicts the next character using the last 8 characters that it spit out (starting with the seed value).

def get_next_n_chars(model, s, n):
    r = s
    for i in range(n):
        c = get_next_char(model, s)
        r += c
        s = s[1:] + c
    return r

get_next_n_chars(model, 'for thos', 40)

'for those who has not to be a conscience of the '

It's definitely getting better. There are more improvements we can make though! In the current model, each instance of the data is completely independent. When a new sequence comes in, the model has no idea what came before that sequence. That "hidden state" mentioned earlier (which is now part of the RNN layer) gets thrown away. However, there's a way we can set this up that persists that hidden state through to the next part of the sequence. In other words, it conditions the output not only on the current 8 characters but all the characters that came before it as well.

The good news is that this capability is built into Keras's recurrent layers, we just need to set a flag to true! The bad news is that we need to re-think how the data is structured. Stateful models require 1) a fixed batch size, which is specified in the model input, and 2) that each batch be a "slice" of sequences such that the next batch contains the next part of each sequence. In other words, we need to split up our data (which is one long continuous stream of text) into n chunks of equal-length streams of text, where n is the batch size. Then, we need to carve up these n chunks into sequences of length 8 (which is the sequence length the model looks at) with the following character in each sequence being the target (the thing we're predicting).

If that sounds confusing and complicated, that's because it is. It took me a while to make sense of it (and figure out how to express it in code) but hopefully you can follow along. Below is the first step, which splits the data up into chunks and stacks them vertically into an array. The result is 64 equal-length continuous sequences of text.

bs = 64
seg_len = len(text) // bs
segments = [idx[i*seg_len:(i+1)*seg_len] for i in range(bs)]
segments = np.stack(segments)

segments.shape

(64, 9388)

One other change happening at the same time is we're no longer staggering the input by one character (which duplicates a lot of text because most of it is repeated in each row). Instead, we're now carving the data into chucks of non-overlapping characters like we did originally. However, we're going to make better use of it this time. Instead of just predicting character 8 based on characters 0-7, we're going to predict characters 1-8 conditioned on the characters in the sequence that came before them. Each pass will actually be 8 character predictions, and the loss function will be calculated across all of those outputs (we'll see how to do this in a minute).

Below we're creating a list of lists, where each sub-list is an 8-character sequence. The second list is offset by one (this is our target).

c_in = [segments[:,i*cs:(i+1)*cs] for i in range(seg_len // cs)]
c_out = [segments[:,(i*cs)+1:((i+1)*cs)+1] for i in range(seg_len // cs)]

Now we just need to concatenate and reshape these into arrays that we can use with the model. We end up with ~75,000 chunks of unique 8-character sequences.

X = np.concatenate(c_in)
X = X.reshape((X.shape[0], X.shape[1], 1))
y = np.concatenate(c_out)
y_cat = keras.utils.to_categorical(y)

X.shape, y_cat.shape

((75072, 8, 1), (75072, 8, 57))

Crucially, they are ordered such that the 65th row is a continuation of the 1st row, the 66th row is a continuation of the 2nd row, and so on all the way down.

''.join(indices_char[i] for i in np.concatenate((X[0,:,0], X[64,:,0], X[128,:,0])))

'preface\n\n\nsupposing that'

Next we can create the stateful RNN model. It's similar to the last one but there are a few wrinkles. The input specifies "batch_shape" and has three dimensions (this is a hard requirement to use stateful RNNs in Keras, and gets quite annoying during inference time). We've set "return_sequences" to true, which changes the shape that the RNN returns and gives us an output for each step in the sequence. We've set "stateful" to true, the motivation for which was already discussed. Finally, we've wrapped the last dense layer with "TimeDistributed". This is because the RNN is now returning a higher-dimensional array to account for the output at each timestep. Everything else works basically the same way.

def CharStatefulRnn(vocab_size, n_chars, n_factors, n_hidden, bs):
    i = Input(batch_shape=(bs, n_chars, 1))
    x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
    x = Reshape((n_chars, n_factors))(x)
    x = SimpleRNN(n_hidden, activation='tanh', return_sequences=True, stateful=True)(x)
    x = TimeDistributed(Dense(vocab_size, activation='softmax'))(x)

    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=opt)

    return model

Looking at the output shapes, we can see the effect of turning on "return_sequences". Note that the number of model parameters has not changed. The complexity is identical, we've just changed the task and the information available to solve it.

model = CharStatefulRnn(vocab_size, cs, n_factors, n_hidden, bs)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_13 (InputLayer)        (64, 8, 1)                0         
_________________________________________________________________
time_distributed_2 (TimeDist (64, 8, 1, 42)            2394      
_________________________________________________________________
reshape_4 (Reshape)          (64, 8, 42)               0         
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (64, 8, 256)              76544     
_________________________________________________________________
time_distributed_3 (TimeDist (64, 8, 57)               14649     
=================================================================
Total params: 93,587
Trainable params: 93,587
Non-trainable params: 0
_________________________________________________________________

One quirk of using stateful RNNs is that we now have to manually reset the model state, it never goes away until we tell it to. I just created a simple callback that resets the state at the end of every epoch.

from keras.callbacks import Callback

class ResetModelState(Callback):    
    def on_epoch_end(self, epoch, logs):
        self.model.reset_states()

reset_state = ResetModelState()

Train the model for a while as before, with the addition of the callback to reset state between epochs.

model.fit(x=X, y=y_cat, batch_size=bs, epochs=8, verbose=1, callbacks=[reset_state], shuffle=False)

Epoch 1/8
75072/75072 [==============================] - 21s 277us/step - loss: 2.2509
Epoch 2/8
75072/75072 [==============================] - 20s 261us/step - loss: 1.8441
Epoch 3/8
75072/75072 [==============================] - 20s 263us/step - loss: 1.6865
Epoch 4/8
75072/75072 [==============================] - 19s 259us/step - loss: 1.6052
Epoch 5/8
75072/75072 [==============================] - 20s 261us/step - loss: 1.5540
Epoch 6/8
75072/75072 [==============================] - 20s 262us/step - loss: 1.5186
Epoch 7/8
75072/75072 [==============================] - 20s 261us/step - loss: 1.4922
Epoch 8/8
75072/75072 [==============================] - 20s 263us/step - loss: 1.4714

K.set_value(model.optimizer.lr, 0.0001)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=3, verbose=1, callbacks=[reset_state], shuffle=False)

Epoch 1/3
75072/75072 [==============================] - 19s 259us/step - loss: 1.4280
Epoch 2/3
75072/75072 [==============================] - 19s 259us/step - loss: 1.4191
Epoch 3/3
75072/75072 [==============================] - 20s 264us/step - loss: 1.4154

The "get next" functions need to be updated since our approach has changed. One of the annoying things about stateful models is the batch size is fixed, so even when making a prediction it needs an array of the same size, no matter if we just want to predict one sequence. I got around this with some numpy hackery.

def get_next_char(model, bs, s):
    idxs = np.array([char_indices[c] for c in s])
    idxs = idxs.reshape((1, idxs.shape[0], 1))
    idxs = np.repeat(idxs, bs, axis=0)
    pred = model.predict(idxs, batch_size=bs)
    char_idx = np.argmax(pred[0, 7])
    return chars[char_idx]

def get_next_n_chars(model, bs, s, n):
    r = s
    for i in range(n):
        c = get_next_char(model, bs, s)
        r += c
        s = s[1:] + c
    return r

get_next_n_chars(model, bs, 'for thos', 40)

'for those in the same the same the same the same'

The output is actually a bit worse than before, but we're still using simple RNNs which aren't that great to begin with. The real fun comes when we make the jump to a more complex unit like the LSTM. The details of LSTM's are beyond my scope here but there's a great blog post that everyone links to as the canonical explainer for LTMS, which you can find here. This is the easiest step yet as the only thing we need to do is replace the class name. The only other change I made is increasing the number of hidden units. Everything else stays exactly the same.

from keras.layers import LSTM

n_hidden = 512

def CharStatefulLSTM(vocab_size, n_chars, n_factors, n_hidden, bs):
    i = Input(batch_shape=(bs, n_chars, 1))
    x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
    x = Reshape((n_chars, n_factors))(x)
    x = LSTM(n_hidden, return_sequences=True, stateful=True)(x)
    x = TimeDistributed(Dense(vocab_size, activation='softmax'))(x)

    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=opt)

    return model

LSTMs need to train for a bit longer. We'll do 20 epochs at each learning rate.

model = CharStatefulLSTM(vocab_size, cs, n_factors, n_hidden, bs)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=20, verbose=1, callbacks=[reset_state], shuffle=False)

Epoch 1/20
75072/75072 [==============================] - 30s 401us/step - loss: 2.1748
Epoch 2/20
75072/75072 [==============================] - 29s 380us/step - loss: 1.6091
Epoch 3/20
75072/75072 [==============================] - 29s 381us/step - loss: 1.4487
Epoch 4/20
75072/75072 [==============================] - 28s 379us/step - loss: 1.3695
Epoch 5/20
75072/75072 [==============================] - 29s 383us/step - loss: 1.3181
Epoch 6/20
75072/75072 [==============================] - 29s 385us/step - loss: 1.2797
Epoch 7/20
75072/75072 [==============================] - 29s 382us/step - loss: 1.2500
Epoch 8/20
75072/75072 [==============================] - 29s 388us/step - loss: 1.2254
Epoch 9/20
75072/75072 [==============================] - 29s 380us/step - loss: 1.2052
Epoch 10/20
75072/75072 [==============================] - 28s 377us/step - loss: 1.1886
Epoch 11/20
75072/75072 [==============================] - 29s 385us/step - loss: 1.1754
Epoch 12/20
75072/75072 [==============================] - 28s 379us/step - loss: 1.1649
Epoch 13/20
75072/75072 [==============================] - 29s 385us/step - loss: 1.1563
Epoch 14/20
75072/75072 [==============================] - 29s 390us/step - loss: 1.1499
Epoch 15/20
75072/75072 [==============================] - 29s 383us/step - loss: 1.1447
Epoch 16/20
75072/75072 [==============================] - 28s 377us/step - loss: 1.1404
Epoch 17/20
75072/75072 [==============================] - 29s 384us/step - loss: 1.1371
Epoch 18/20
75072/75072 [==============================] - 29s 383us/step - loss: 1.1334
Epoch 19/20
75072/75072 [==============================] - 28s 379us/step - loss: 1.1328
Epoch 20/20
75072/75072 [==============================] - 28s 378us/step - loss: 1.1314

K.set_value(model.optimizer.lr, 0.0001)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=20, verbose=1, callbacks=[reset_state], shuffle=False)

Epoch 1/20
75072/75072 [==============================] - 29s 382us/step - loss: 1.1015
Epoch 2/20
75072/75072 [==============================] - 32s 428us/step - loss: 1.0755
Epoch 3/20
75072/75072 [==============================] - 33s 442us/step - loss: 1.0633
Epoch 4/20
75072/75072 [==============================] - 30s 406us/step - loss: 1.0552
Epoch 5/20
75072/75072 [==============================] - 29s 381us/step - loss: 1.0489
Epoch 6/20
75072/75072 [==============================] - 28s 372us/step - loss: 1.0434
Epoch 7/20
75072/75072 [==============================] - 28s 372us/step - loss: 1.0392
Epoch 8/20
75072/75072 [==============================] - 29s 381us/step - loss: 1.0354
Epoch 9/20
75072/75072 [==============================] - 28s 376us/step - loss: 1.0323
Epoch 10/20
75072/75072 [==============================] - 28s 379us/step - loss: 1.0293
Epoch 11/20
75072/75072 [==============================] - 28s 379us/step - loss: 1.0264
Epoch 12/20
75072/75072 [==============================] - 28s 373us/step - loss: 1.0246
Epoch 13/20
75072/75072 [==============================] - 28s 376us/step - loss: 1.0224
Epoch 14/20
75072/75072 [==============================] - 28s 373us/step - loss: 1.0203
Epoch 15/20
75072/75072 [==============================] - 29s 382us/step - loss: 1.0183
Epoch 16/20
75072/75072 [==============================] - 28s 376us/step - loss: 1.0162
Epoch 17/20
75072/75072 [==============================] - 28s 376us/step - loss: 1.0150
Epoch 18/20
75072/75072 [==============================] - 28s 376us/step - loss: 1.0134
Epoch 19/20
75072/75072 [==============================] - 28s 377us/step - loss: 1.0125
Epoch 20/20
75072/75072 [==============================] - 28s 378us/step - loss: 1.0108

And now the moment of truth!

pprint(get_next_n_chars(model, bs, 'for thos', 400))

('for those whoever be no longer for their shows that is the basic of the '
 'conseque of the conseque once more proves and the same of the consequent, '
 'and at the other that is the basic of the conseque perfeaced itself to the '
 'sense and self-conseque contemptations of the conseque once still that the '
 'great people take a soul as a profoundination and an artistic as something '
 'might be the most problem and self-co')

Ha, well I wouldn't quite call it sensible but it's not super-terrible either. It's forming mostly complete words, occasionally using punctuation, etc. Not bad for being trained one character at a time. There are many ways that this can be improved of course, but hopefully this has illustrated the key concepts to building a sequence model.

Deep Learning With Keras: Recommender Systems

John Wittenauer — Mon, 29 Apr 2019 18:36:03 GMT

In this post we'll continue the series on deep learning by using the popular Keras framework to build a recommender system. This use case is much less common in deep learning literature than things like image classifiers or text generators, but may arguably be an even more common problem. In fact, as you'll see below, it's debatable whether this topic even qualifies as "deep learning" because we're going to see how to build a pretty good recommender system without using a neural network at all! We will, however, take advantage of the power of a modern computation framework like Keras to implement the recommender with minimal code. We'll try a couple different approaches using a technique called collaborative filtering. Finally we'll build a true neural network and see how it compares to the collaborative filtering approach.

The data used for this task is the MovieLens data set. As with the previous posts, much of this content is originally based on Jeremy Howard's excellent fast.ai lessons.

I've already saved the zip file to a local directory so we can get started with some imports and reading in the ratings.csv file, which is where the data for this task comes from.

%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

PATH = '/home/paperspace/data/ml-latest-small/'

ratings = pd.read_csv(PATH + 'ratings.csv')
ratings.head()

userId	movieId	rating	timestamp
1	31	2.5	1260759144
1	1029	3.0	1260759179
1	1061	3.0	1260759182
1	1129	2.0	1260759185
1	1172	4.0	1260759205

The data is tabular and consists of a user ID, a movie ID, and a rating (there's also a timestamp but we won't use it for this task). Our task is to predict the rating for a user/movie pair, with the idea that if we had a model that's good at this task then we could predict how a user would rate movies they haven't seen yet and recommend movies with the highest predicted rating.

The zip file also includes a listing of movies and their associated genres. We don't actually need this for the model but it's useful to know about.

movies = pd.read_csv(PATH + 'movies.csv')
movies.head()


movieId	title	genres
1	Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2	Jumanji (1995) Adventure|Children|Fantasy
3	Grumpier Old Men (1995) Comedy|Romance
4	Waiting to Exhale (1995) Comedy|Drama|Romance
5	Father of the Bride Part II (1995) Comedy

To get a better sense of what the data looks like, we can turn it into a table by selecting the top 15 users/movies from the data and joining them together. The result shows how each of the top users rated each of the top movies.

g = ratings.groupby('userId')['rating'].count()
top_users = g.sort_values(ascending=False)[:15]

g = ratings.groupby('movieId')['rating'].count()
top_movies = g.sort_values(ascending=False)[:15]

top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='userId')
top_r = top_r.join(top_movies, rsuffix='_r', how='inner', on='movieId')

pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)

movieId	1	110	260	296	318	356	480	527	589	593	608	1196	1198	1270	2571
userId															
15	2.0	3.0	5.0	5.0	2.0	1.0	3.0	4.0	4.0	5.0	5.0	5.0	4.0	5.0	5.0
30	4.0	5.0	4.0	5.0	5.0	5.0	4.0	5.0	4.0	4.0	5.0	4.0	5.0	5.0	3.0
73	5.0	4.0	4.5	5.0	5.0	5.0	4.0	5.0	3.0	4.5	4.0	5.0	5.0	5.0	4.5
212	3.0	5.0	4.0	4.0	4.5	4.0	3.0	5.0	3.0	4.0	NaN	NaN	3.0	3.0	5.0
213	3.0	2.5	5.0	NaN	NaN	2.0	5.0	NaN	4.0	2.5	2.0	5.0	3.0	3.0	4.0
294	4.0	3.0	4.0	NaN	3.0	4.0	4.0	4.0	3.0	NaN	NaN	4.0	4.5	4.0	4.5
311	3.0	3.0	4.0	3.0	4.5	5.0	4.5	5.0	4.5	2.0	4.0	3.0	4.5	4.5	4.0
380	4.0	5.0	4.0	5.0	4.0	5.0	4.0	NaN	4.0	5.0	4.0	4.0	NaN	3.0	5.0
452	3.5	4.0	4.0	5.0	5.0	4.0	5.0	4.0	4.0	5.0	5.0	4.0	4.0	4.0	2.0
468	4.0	3.0	3.5	3.5	3.5	3.0	2.5	NaN	NaN	3.0	4.0	3.0	3.5	3.0	3.0
509	3.0	5.0	5.0	5.0	4.0	4.0	3.0	5.0	2.0	4.0	4.5	5.0	5.0	3.0	4.5
547	3.5	NaN	NaN	5.0	5.0	2.0	3.0	5.0	NaN	5.0	5.0	2.5	2.0	3.5	3.5
564	4.0	1.0	2.0	5.0	NaN	3.0	5.0	4.0	5.0	5.0	5.0	5.0	5.0	3.0	3.0
580	4.0	4.5	4.0	4.5	4.0	3.5	3.0	4.0	4.5	4.0	4.5	4.0	3.5	3.0	4.5
624	5.0	NaN	5.0	5.0	NaN	3.0	3.0	NaN	3.0	5.0	4.0	5.0	5.0	5.0	2.0

To build our first collaborative filtering model, we need to take care of a few things first. The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later). We can use scikit-learn's LabelEncoder class to transform the fields. We'll also create variables with the total number of unique users and movies in the data, as well as the min and max ratings present in the data, for reasons that will become apparent shortly.

user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()

item_enc = LabelEncoder()
ratings['movie'] = item_enc.fit_transform(ratings['movieId'].values)
n_movies = ratings['movie'].nunique()

ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])

n_users, n_movies, min_rating, max_rating

(671, 9066, 0.5, 5.0)

Create a traditional (X, y) pairing of data and label, then split the data into training and test data sets.

X = ratings[['user', 'movie']].values
y = ratings['rating'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((90003, 2), (10001, 2), (90003,), (10001,))

Another constant we'll need for the model is the number of factors per user/movie. This number can be whatever we want, however for the collaborative filtering model it does need to be the same size for both users and movies. When Jeremy covered this in his class, he said he played around with different numbers and 50 seemed to work best so we'll go with that.

Finally, we need to turn users and movies into separate arrays in the training and test data. This is because in Keras they'll each be defined as distinct inputs, and the way Keras works is each input needs to be fed in as its own array.

n_factors = 50

X_train_array = [X_train[:, 0], X_train[:, 1]]
X_test_array = [X_test[:, 0], X_test[:, 1]]

Now we get to the model itself. The main idea here is we're going to use embeddings to represent each user and each movie in the data. These embeddings will be vectors (of size n_factors) that start out as random numbers but are fit by the model to capture the essential qualities of each user/movie. We can accomplish this by computing the dot product between a user vector and a movie vector to get a predicted rating. The code is fairly simple, there isn't even a traditional neural network layer or activation involved. I stuck some regularization on the embedding layers and used a different initializer but even that probably isn't necessary. Notice that this is where we need the number of unique users and movies, since those are required to define the size of each embedding matrix.

from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

def RecommenderV1(n_users, n_movies, n_factors):
    user = Input(shape=(1,))
    u = Embedding(n_users, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(user)
    u = Reshape((n_factors,))(u)
    
    movie = Input(shape=(1,))
    m = Embedding(n_movies, n_factors, embeddings_initializer='he_normal',
                  embeddings_regularizer=l2(1e-6))(movie)
    m = Reshape((n_factors,))(m)
    
    x = Dot(axes=1)([u, m])

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

This is kind of a neat example of how flexible and powerful modern computation frameworks like Keras and PyTorch are. Even though these are billed as deep learning libraries, they have the building blocks to quickly create any computation graph you want and get automatic differentiation essentially for free. Below you can see that all of the parameters are in the embedding layers, we don't have any traditional neural net components at all.

model = RecommenderV1(n_users, n_movies, n_factors)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 50)        33550       input_1[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 1, 50)        453300      input_2[0][0]                    
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 50)           0           embedding_1[0][0]                
__________________________________________________________________________________________________
reshape_2 (Reshape)             (None, 50)           0           embedding_2[0][0]                
__________________________________________________________________________________________________
dot_1 (Dot)                     (None, 1)            0           reshape_1[0][0]                  
                                                                 reshape_2[0][0]                  
==================================================================================================
Total params: 486,850
Trainable params: 486,850
Non-trainable params: 0
__________________________________________________________________________________________________

Let's go ahead and train this for a few epochs and see what we get.

history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))

Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 66us/step - loss: 9.7935 - val_loss: 3.4641
Epoch 2/5
90003/90003 [==============================] - 4s 49us/step - loss: 2.0427 - val_loss: 1.6521
Epoch 3/5
90003/90003 [==============================] - 4s 49us/step - loss: 1.1574 - val_loss: 1.3535
Epoch 4/5
90003/90003 [==============================] - 4s 48us/step - loss: 0.9027 - val_loss: 1.2607
Epoch 5/5
90003/90003 [==============================] - 4s 48us/step - loss: 0.7786 - val_loss: 1.2209

Not bad for a first try. We can make some improvements though. The first thing we can do is add a "bias" to each embedding. The concept is similar to the bias in a fully-connected layer or the intercept in a linear model. It just provides an extra degree of freedom. We can implement this idea using new embedding layers with a vector length of one. The bias embeddings get added to the result of the dot product.

The second improvement we can make is running the output of the dot product through a sigmoid layer and then scaling the result using the min and max ratings in the data. This is a neat technique that introduces a non-linearity into the output and results in a modest performance bump.

I also refactored the code a bit by pulling out the embedding layer and reshape operation into a separate class.

from keras.layers import Add, Activation, Lambda

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal',
                      embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        return x

def RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    mb = EmbeddingLayer(n_movies, 1)(movie)

    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

The model summary shows the new graph. Notice the additional embedding layers with parameter numbers equal to the unique user and movie counts.

model = RecommenderV2(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_3 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 1, 50)        33550       input_3[0][0]                    
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, 1, 50)        453300      input_4[0][0]                    
__________________________________________________________________________________________________
reshape_3 (Reshape)             (None, 50)           0           embedding_3[0][0]                
__________________________________________________________________________________________________
reshape_5 (Reshape)             (None, 50)           0           embedding_5[0][0]                
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 1, 1)         671         input_3[0][0]                    
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, 1, 1)         9066        input_4[0][0]                    
__________________________________________________________________________________________________
dot_2 (Dot)                     (None, 1)            0           reshape_3[0][0]                  
                                                                 reshape_5[0][0]                  
__________________________________________________________________________________________________
reshape_4 (Reshape)             (None, 1)            0           embedding_4[0][0]                
__________________________________________________________________________________________________
reshape_6 (Reshape)             (None, 1)            0           embedding_6[0][0]                
__________________________________________________________________________________________________
add_1 (Add)                     (None, 1)            0           dot_2[0][0]                      
                                                                 reshape_4[0][0]                  
                                                                 reshape_6[0][0]                  
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 1)            0           add_1[0][0]                      
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 1)            0           activation_1[0][0]               
==================================================================================================
Total params: 496,587
Trainable params: 496,587
Non-trainable params: 0
__________________________________________________________________________________________________

history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))

Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 64us/step - loss: 1.2850 - val_loss: 0.9083
Epoch 2/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.7445 - val_loss: 0.7801
Epoch 3/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.5615 - val_loss: 0.7646
Epoch 4/5
90003/90003 [==============================] - 5s 57us/step - loss: 0.4273 - val_loss: 0.7669
Epoch 5/5
90003/90003 [==============================] - 5s 58us/step - loss: 0.3298 - val_loss: 0.7823

Those two additions to the model resulted in a pretty sizable improvement. Validation error is now down to ~0.76 which is about as good as what Jeremy got (and I believe close to SOTA for this data set).

That pretty much covers the conventional approach to solving this problem, but there's another way we can tackle this. Instead of taking the dot product of the embedding vectors, what if we just concatenated the embeddings together and stuck a fully-connected layer on top of them? It's still not technically "deep" but it would at least be a neural network! To modify the code, we can remove the bias embeddings from V2 and do a concat on the embedding layers instead. Then we can add some dropout, insert a dense layer, and stick some dropout on the dense layer as well. Finally, we'll run it through a single-unit dense layer to keep the sigmoid trick at the end.

from keras.layers import Concatenate, Dense, Dropout

def RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    
    movie = Input(shape=(1,))
    m = EmbeddingLayer(n_movies, n_factors)(movie)
    
    x = Concatenate()([u, m])
    x = Dropout(0.05)(x)
    
    x = Dense(10, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.5)(x)
    
    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)

    model = Model(inputs=[user, movie], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)

    return model

Most of the parameters are still in the embedding layers, but we have some added learning capability from the dense layers.

model = RecommenderNet(n_users, n_movies, n_factors, min_rating, max_rating)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_5 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 1, 50)        33550       input_5[0][0]                    
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 1, 50)        453300      input_6[0][0]                    
__________________________________________________________________________________________________
reshape_7 (Reshape)             (None, 50)           0           embedding_7[0][0]                
__________________________________________________________________________________________________
reshape_8 (Reshape)             (None, 50)           0           embedding_8[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 100)          0           reshape_7[0][0]                  
                                                                 reshape_8[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 100)          0           concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 10)           1010        dropout_1[0][0]                  
__________________________________________________________________________________________________
activation_2 (Activation)       (None, 10)           0           dense_1[0][0]                    
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 10)           0           activation_2[0][0]               
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            11          dropout_2[0][0]                  
__________________________________________________________________________________________________
activation_3 (Activation)       (None, 1)            0           dense_2[0][0]                    
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 1)            0           activation_3[0][0]               
==================================================================================================
Total params: 487,871
Trainable params: 487,871
Non-trainable params: 0
__________________________________________________________________________________________________

history = model.fit(x=X_train_array, y=y_train, batch_size=64, epochs=5,
                    verbose=1, validation_data=(X_test_array, y_test))

Train on 90003 samples, validate on 10001 samples
Epoch 1/5
90003/90003 [==============================] - 6s 71us/step - loss: 0.9461 - val_loss: 0.8079
Epoch 2/5
90003/90003 [==============================] - 6s 64us/step - loss: 0.8097 - val_loss: 0.7898
Epoch 3/5
90003/90003 [==============================] - 6s 63us/step - loss: 0.7781 - val_loss: 0.7855
Epoch 4/5
90003/90003 [==============================] - 6s 64us/step - loss: 0.7617 - val_loss: 0.7820
Epoch 5/5
90003/90003 [==============================] - 6s 63us/step - loss: 0.7513 - val_loss: 0.7858

Without doing any tuning at all we still managed to get a result that's pretty close to the best performance we saw with the traditional approach. This technique has the added benefit that we can easily incorporate additional features into the model. For instance, we could create some date features from the timestamp or throw in the movie genres as a new embedding layer. We could tune the size of the movie and user embeddings independently since they no longer need to match. Lots of possibilities here.

Deep Learning With Keras: Convolutional Networks

John Wittenauer — Tue, 08 Jan 2019 01:02:17 GMT

In my last post, I kicked off a series on deep learning by showing how to apply several core neural network concepts such as dense layers, embeddings, and regularization to build models using structured and/or time-series data. In this post we'll see how to build models using another core component in modern deep learning: convolutions. Convolutional layers are primarily used in image-based models but have some interesting properties that make them useful for sequential data as well. The biggest wrinkle that convolutional layers introduce is an element of locality. They have a receptive field that consists of some subset of the input data. In essence, each convolution can only "see" part of the image, sequence etc. that it's being trained on.

I'm not going to cover convolutional layers in-depth here, there are tons of great resources out there already to learn about them. If you're new to the concept, I would recommend this blog series as a starting point or just do some googling for explainers. There's a lot of good content that comes up.

We'll start with a simple dense network and gradually improve it until we're getting pretty good results classifying images in the CIFAR 10 data set. We'll then see how we can avoid building a network from scratch by taking a large, pre-trained net and fine-tuning it to a custom domain. As with my first post in this series, much of this content is originally based on Jeremy Howard's fast.ai lessons. I've combined content from a few different lessons and converted code to use Keras instead of PyTorch.

Since Keras comes with a pre-built data loader for CIFAR 10, we can just use that to get started instead of worrying about locating and importing the data.

%matplotlib inline
import matplotlib.pyplot as plt
from keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((50000, 32, 32, 3), (50000, 1), (10000, 32, 32, 3), (10000, 1))

Plot a few of the images to get an idea what they look like and confirm that the data loaded correctly. You'll quickly notice the CIFAR 10 images are very low resolution (32 x 32 images with 3 color channels). This makes training from scratch quite feasible even on modest compute resources.

def plot_image(index):
    image = x_train[index, :, :, :]
    plt.imshow(image)

plot_image(4)

plot_image(6)

We need to do a data conversion to get the class labels in one-hot encoded format. This will allow us to use a softmax activation and categorical cross-entopy loss in our network. CIFAR 10 only has 10 distinct classes so this is fairly straightforward.

import keras

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

y_train[0]

array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)

The only other pre-processing step to apply is normalizing the input data. Since everything is an RGB value, we can keep it simple and just divide by 255.

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

Define a few useful configuration items to use throughout the exercise. The input shape variable will have a value of (32, 32, 3) corresponding to the shape of the array for each image.

in_shape = x_train.shape[1:]
batch_size = 256
n_classes = 10
lr = 0.01

Now we can get started with the actual modeling part. For a first attempt, let's do the simplest and most naive model possible. We'll just create a straightforward fully-connected model and stick a softmax activation on at the end.

from keras.models import Model
from keras.layers import Activation, Dense, Flatten, Input
from keras.optimizers import Adam

def SimpleNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    x = Flatten()(i)
    
    for n in range(len(layers)):
        x = Dense(layers[n])(x)
        x = Activation('relu')(x)
    
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

Note that the architecture is somewhat flexible in that we can define as many dense layers as we want by just passing in a list of numbers to the "layers" parameter (where the numbers correspond to the size of the layer). In this case we're only going to use one layer, but this capability will be very useful later on.

model = SimpleNet(in_shape, [40], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 40)                122920    
_________________________________________________________________
activation_1 (Activation)    (None, 40)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                410       
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
=================================================================
Total params: 123,330
Trainable params: 123,330
Non-trainable params: 0
_________________________________________________________________

Our last step before training is to define an image data generator. We could just train on the images as-is, but randomly applying transformations to the images will make the classifier more robust. Keras has a utility class built in for just this purpose, so we can use that to randomly shift or flip the direction of the images during training.

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)

Let's try training for 10 epochs and see what happens!

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 26s 134ms/step - loss: 2.4531 - acc: 0.0974 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 2/10
196/196 [==============================] - 23s 120ms/step - loss: 2.2576 - acc: 0.1268 - val_loss: 2.1575 - val_acc: 0.1623
Epoch 3/10
196/196 [==============================] - 24s 120ms/step - loss: 2.1256 - acc: 0.1741 - val_loss: 2.0836 - val_acc: 0.1764
Epoch 4/10
196/196 [==============================] - 24s 122ms/step - loss: 2.1123 - acc: 0.1760 - val_loss: 2.0775 - val_acc: 0.1972
Epoch 5/10
196/196 [==============================] - 23s 119ms/step - loss: 2.0938 - acc: 0.1802 - val_loss: 2.0716 - val_acc: 0.1710
Epoch 6/10
196/196 [==============================] - 24s 120ms/step - loss: 2.0940 - acc: 0.1784 - val_loss: 2.0660 - val_acc: 0.1875
Epoch 7/10
196/196 [==============================] - 24s 121ms/step - loss: 2.0894 - acc: 0.1822 - val_loss: 2.1032 - val_acc: 0.1765
Epoch 8/10
196/196 [==============================] - 24s 121ms/step - loss: 2.0954 - acc: 0.1799 - val_loss: 2.0751 - val_acc: 0.1745
Epoch 9/10
196/196 [==============================] - 23s 120ms/step - loss: 2.0853 - acc: 0.1788 - val_loss: 2.0702 - val_acc: 0.1743
Epoch 10/10
196/196 [==============================] - 23s 120ms/step - loss: 2.0889 - acc: 0.1775 - val_loss: 2.0659 - val_acc: 0.1844

Clearly the naive approach is not very effective. The model is basically doing a bit better than randomly guessing. Let's replace the dense layer with a few convolutional layers instead. For our first attempt at using convolutions, we'll use a kernel size of 3 and a stride of 2 (rather than use pooling layers in between the conv layers) and a global max pooling layer to condense the output shape before going through the softmax.

from keras.layers import Conv2D, GlobalMaxPooling2D

def ConvNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    for n in range(len(layers)):
        if n == 0:
            x = Conv2D(layers[n], kernel_size=3, strides=2)(i)
        else:
            x = Conv2D(layers[n], kernel_size=3, strides=2)(x)
        x = Activation('relu')(x)
    
    x = GlobalMaxPooling2D()(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

This time let's try using 3 conv layers with an increasing number of filters in each layer.

model = ConvNet(in_shape, [20, 40, 80], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 15, 15, 20)        560       
_________________________________________________________________
activation_3 (Activation)    (None, 15, 15, 20)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 40)          7240      
_________________________________________________________________
activation_4 (Activation)    (None, 7, 7, 40)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 80)          28880     
_________________________________________________________________
activation_5 (Activation)    (None, 3, 3, 80)          0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 80)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                810       
_________________________________________________________________
activation_6 (Activation)    (None, 10)                0         
=================================================================
Total params: 37,490
Trainable params: 37,490
Non-trainable params: 0
_________________________________________________________________

It's worth checking your intuition and understanding of what's going on by looking at the summary output and verifying that the numbers make sense. For instance, why does the first convolutional layer have 560 parameters? Where does that come from? Well, we have a kernel size of 3 which creates a 3 x 3 filter (i.e. 9 parameters), but we also have different color channels for a depth of 3, so each filter is really 3 x 3 x 3 = 27 parameters, plus 1 for the bias so 28 per filter. We specified 20 filters in the first layer, so 28 X 20 = 560. Try applying similar logic to the second conv layer and see if the result makes sense.

Now that we've got a model, let's try training it using the exact same approach as before.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 25s 127ms/step - loss: 1.8725 - acc: 0.3019 - val_loss: 1.7737 - val_acc: 0.3772
Epoch 2/10
196/196 [==============================] - 24s 120ms/step - loss: 1.6342 - acc: 0.4015 - val_loss: 1.5930 - val_acc: 0.4314
Epoch 3/10
196/196 [==============================] - 24s 120ms/step - loss: 1.5503 - acc: 0.4349 - val_loss: 1.5013 - val_acc: 0.4567
Epoch 4/10
196/196 [==============================] - 24s 122ms/step - loss: 1.4848 - acc: 0.4623 - val_loss: 1.4356 - val_acc: 0.4801
Epoch 5/10
196/196 [==============================] - 24s 122ms/step - loss: 1.4493 - acc: 0.4798 - val_loss: 1.3845 - val_acc: 0.4972
Epoch 6/10
196/196 [==============================] - 23s 119ms/step - loss: 1.4186 - acc: 0.4892 - val_loss: 1.3761 - val_acc: 0.5066
Epoch 7/10
196/196 [==============================] - 24s 121ms/step - loss: 1.3999 - acc: 0.4956 - val_loss: 1.3681 - val_acc: 0.5024
Epoch 8/10
196/196 [==============================] - 24s 121ms/step - loss: 1.3837 - acc: 0.5047 - val_loss: 1.4632 - val_acc: 0.4810
Epoch 9/10
196/196 [==============================] - 23s 120ms/step - loss: 1.3838 - acc: 0.5006 - val_loss: 1.3647 - val_acc: 0.5139
Epoch 10/10
196/196 [==============================] - 24s 120ms/step - loss: 1.3565 - acc: 0.5114 - val_loss: 1.3553 - val_acc: 0.5162

The results are a lot different this time! The model is clearly learning and after 10 epochs we're at about 50% accuracy on the validation set. Still, we should be able to do a lot better. For the next attempt let's introduce a few new wrinkles. First, we're going to add batch normalization after each conv layer. Second, we're going to add a single conv layer at the beginning with a larger kernel size and a stride of 1 so we don't reduce the receptive field. Third, we're going to introduce padding which will modify the shape of each conv layer output. Finally, we're going to add a few more layers to make the model bigger.

To make the model definition more modular, I've pulled out the conv layer into a separate class. There are multiple ways to do this (a function would have worked just as well) but I opted to mimic the way Keras's functional API works.

from keras.layers import BatchNormalization

class ConvLayer:
    def __init__(self, filters, kernel_size, stride):
        self.filters = filters
        self.kernel_size = kernel_size
        self.stride = stride

    def __call__(self, x):
        x = Conv2D(self.filters, kernel_size=self.kernel_size,
                   strides=self.stride, padding='same', use_bias=False)(x)
        x = Activation('relu')(x)
        x = BatchNormalization()(x)
        return x

def ConvNet2(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    x = Conv2D(layers[0], kernel_size=5, strides=1, padding='same')(i)
    x = Activation('relu')(x)
    
    for n in range(1, len(layers)):
        x = ConvLayer(layers[n], kernel_size=3, stride=2)(x)

    x = GlobalMaxPooling2D()(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

model = ConvNet2(in_shape, [10, 20, 40, 80, 160], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 32, 10)        760       
_________________________________________________________________
activation_7 (Activation)    (None, 32, 32, 10)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 16, 16, 20)        1800      
_________________________________________________________________
activation_8 (Activation)    (None, 16, 16, 20)        0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 16, 16, 20)        80        
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 8, 8, 40)          7200      
_________________________________________________________________
activation_9 (Activation)    (None, 8, 8, 40)          0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 8, 8, 40)          160       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 4, 4, 80)          28800     
_________________________________________________________________
activation_10 (Activation)   (None, 4, 4, 80)          0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 4, 4, 80)          320       
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 2, 2, 160)         115200    
_________________________________________________________________
activation_11 (Activation)   (None, 2, 2, 160)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 2, 2, 160)         640       
_________________________________________________________________
global_max_pooling2d_2 (Glob (None, 160)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                1610      
_________________________________________________________________
activation_12 (Activation)   (None, 10)                0         
=================================================================
Total params: 156,570
Trainable params: 155,970
Non-trainable params: 600
_________________________________________________________________

We made a bunch of improvements and the network has a much larger capacity, so let's see what it does.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 24s 125ms/step - loss: 1.6451 - acc: 0.4258 - val_loss: 1.5408 - val_acc: 0.4597
Epoch 2/10
196/196 [==============================] - 23s 118ms/step - loss: 1.3130 - acc: 0.5280 - val_loss: 1.7158 - val_acc: 0.4559
Epoch 3/10
196/196 [==============================] - 24s 121ms/step - loss: 1.1669 - acc: 0.5803 - val_loss: 1.5101 - val_acc: 0.5311
Epoch 4/10
196/196 [==============================] - 23s 119ms/step - loss: 1.0642 - acc: 0.6205 - val_loss: 1.3304 - val_acc: 0.5538
Epoch 5/10
196/196 [==============================] - 23s 118ms/step - loss: 0.9887 - acc: 0.6485 - val_loss: 1.2749 - val_acc: 0.5955
Epoch 6/10
196/196 [==============================] - 23s 119ms/step - loss: 0.9264 - acc: 0.6717 - val_loss: 1.3210 - val_acc: 0.5819
Epoch 7/10
196/196 [==============================] - 23s 120ms/step - loss: 0.8812 - acc: 0.6887 - val_loss: 0.9221 - val_acc: 0.6807
Epoch 8/10
196/196 [==============================] - 23s 120ms/step - loss: 0.8437 - acc: 0.6985 - val_loss: 0.8809 - val_acc: 0.7012
Epoch 9/10
196/196 [==============================] - 24s 120ms/step - loss: 0.8196 - acc: 0.7083 - val_loss: 0.9064 - val_acc: 0.6873
Epoch 10/10
196/196 [==============================] - 24s 120ms/step - loss: 0.7897 - acc: 0.7194 - val_loss: 0.8259 - val_acc: 0.7179

That's a significant improvement! Our validation accuracy after 10 epochs jumped all the way from ~50% to ~70%. We're already doing pretty good, but there's one more major addition we can make that should bump performance even higher. A key addition to modern convolutional networks was the invention of residual layers, which introduce an "identity" connection to the output of a block of convolutions. Below I've added a new "ResLayer" class that inherits from "ConvLayer" but outputs the addition of the original input with the output from the conv layer. Building on the previous network, we've now added two residual layers to each "block" in the model definition. These residual layers have a stride of 1 so they don't change the shape of the output. Finally, we've added a bit of regularization to keep the model from overfitting too badly.

from keras import layers
from keras import regularizers
from keras.layers import Dropout

class ConvLayer:
    def __init__(self, filters, kernel_size, stride):
        self.filters = filters
        self.kernel_size = kernel_size
        self.stride = stride

    def __call__(self, x):
        x = Conv2D(self.filters, kernel_size=self.kernel_size,
                   strides=self.stride, padding='same', use_bias=False,
                   kernel_regularizer=regularizers.l2(1e-6))(x)
        x = Activation('relu')(x)
        x = BatchNormalization()(x)
        return x

class ResLayer(ConvLayer):
    def __call__(self, x):
        return layers.add([x, super().__call__(x)])

def ResNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    x = Conv2D(layers[0], kernel_size=7, strides=1, padding='same')(i)
    x = Activation('relu')(x)

    for n in range(1, len(layers)):
        x = ConvLayer(layers[n], kernel_size=3, stride=2)(x)
        x = ResLayer(layers[n], kernel_size=3, stride=1)(x)
        x = ResLayer(layers[n], kernel_size=3, stride=1)(x)

    x = GlobalMaxPooling2D()(x)
    x = Dropout(0.1)(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

model = ResNet(in_shape, [10, 20, 40, 80, 160], n_classes, lr)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_4 (InputLayer)            (None, 32, 32, 3)    0                                            
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, 32, 32, 10)   1480        input_4[0][0]                    
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 32, 32, 10)   0           conv2d_9[0][0]                   
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 16, 16, 20)   1800        activation_13[0][0]              
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 16, 16, 20)   0           conv2d_10[0][0]                  
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 16, 16, 20)   80          activation_14[0][0]              
__________________________________________________________________________________________________
conv2d_11 (Conv2D)              (None, 16, 16, 20)   3600        batch_normalization_5[0][0]      
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 16, 16, 20)   0           conv2d_11[0][0]                  
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 16, 16, 20)   80          activation_15[0][0]              
__________________________________________________________________________________________________
add_1 (Add)                     (None, 16, 16, 20)   0           batch_normalization_5[0][0]      
                                                                 batch_normalization_6[0][0]      
__________________________________________________________________________________________________
conv2d_12 (Conv2D)              (None, 16, 16, 20)   3600        add_1[0][0]                      
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 16, 16, 20)   0           conv2d_12[0][0]                  
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 16, 16, 20)   80          activation_16[0][0]              
__________________________________________________________________________________________________
add_2 (Add)                     (None, 16, 16, 20)   0           add_1[0][0]                      
                                                                 batch_normalization_7[0][0]      
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 8, 8, 40)     7200        add_2[0][0]                      
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 8, 8, 40)     0           conv2d_13[0][0]                  
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 8, 8, 40)     160         activation_17[0][0]              
__________________________________________________________________________________________________
conv2d_14 (Conv2D)              (None, 8, 8, 40)     14400       batch_normalization_8[0][0]      
__________________________________________________________________________________________________
activation_18 (Activation)      (None, 8, 8, 40)     0           conv2d_14[0][0]                  
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 8, 8, 40)     160         activation_18[0][0]              
__________________________________________________________________________________________________
add_3 (Add)                     (None, 8, 8, 40)     0           batch_normalization_8[0][0]      
                                                                 batch_normalization_9[0][0]      
__________________________________________________________________________________________________
conv2d_15 (Conv2D)              (None, 8, 8, 40)     14400       add_3[0][0]                      
__________________________________________________________________________________________________
activation_19 (Activation)      (None, 8, 8, 40)     0           conv2d_15[0][0]                  
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 8, 8, 40)     160         activation_19[0][0]              
__________________________________________________________________________________________________
add_4 (Add)                     (None, 8, 8, 40)     0           add_3[0][0]                      
                                                                 batch_normalization_10[0][0]     
__________________________________________________________________________________________________
conv2d_16 (Conv2D)              (None, 4, 4, 80)     28800       add_4[0][0]                      
__________________________________________________________________________________________________
activation_20 (Activation)      (None, 4, 4, 80)     0           conv2d_16[0][0]                  
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 4, 4, 80)     320         activation_20[0][0]              
__________________________________________________________________________________________________
conv2d_17 (Conv2D)              (None, 4, 4, 80)     57600       batch_normalization_11[0][0]     
__________________________________________________________________________________________________
activation_21 (Activation)      (None, 4, 4, 80)     0           conv2d_17[0][0]                  
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 4, 4, 80)     320         activation_21[0][0]              
__________________________________________________________________________________________________
add_5 (Add)                     (None, 4, 4, 80)     0           batch_normalization_11[0][0]     
                                                                 batch_normalization_12[0][0]     
__________________________________________________________________________________________________
conv2d_18 (Conv2D)              (None, 4, 4, 80)     57600       add_5[0][0]                      
__________________________________________________________________________________________________
activation_22 (Activation)      (None, 4, 4, 80)     0           conv2d_18[0][0]                  
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 4, 4, 80)     320         activation_22[0][0]              
__________________________________________________________________________________________________
add_6 (Add)                     (None, 4, 4, 80)     0           add_5[0][0]                      
                                                                 batch_normalization_13[0][0]     
__________________________________________________________________________________________________
conv2d_19 (Conv2D)              (None, 2, 2, 160)    115200      add_6[0][0]                      
__________________________________________________________________________________________________
activation_23 (Activation)      (None, 2, 2, 160)    0           conv2d_19[0][0]                  
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 2, 2, 160)    640         activation_23[0][0]              
__________________________________________________________________________________________________
conv2d_20 (Conv2D)              (None, 2, 2, 160)    230400      batch_normalization_14[0][0]     
__________________________________________________________________________________________________
activation_24 (Activation)      (None, 2, 2, 160)    0           conv2d_20[0][0]                  
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 2, 2, 160)    640         activation_24[0][0]              
__________________________________________________________________________________________________
add_7 (Add)                     (None, 2, 2, 160)    0           batch_normalization_14[0][0]     
                                                                 batch_normalization_15[0][0]     
__________________________________________________________________________________________________
conv2d_21 (Conv2D)              (None, 2, 2, 160)    230400      add_7[0][0]                      
__________________________________________________________________________________________________
activation_25 (Activation)      (None, 2, 2, 160)    0           conv2d_21[0][0]                  
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 2, 2, 160)    640         activation_25[0][0]              
__________________________________________________________________________________________________
add_8 (Add)                     (None, 2, 2, 160)    0           add_7[0][0]                      
                                                                 batch_normalization_16[0][0]     
__________________________________________________________________________________________________
global_max_pooling2d_3 (GlobalM (None, 160)          0           add_8[0][0]                      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 160)          0           global_max_pooling2d_3[0][0]     
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 10)           1610        dropout_1[0][0]                  
__________________________________________________________________________________________________
activation_26 (Activation)      (None, 10)           0           dense_5[0][0]                    
==================================================================================================
Total params: 771,690
Trainable params: 769,890
Non-trainable params: 1,800
__________________________________________________________________________________________________

The model summary is now getting quite large, but you can still follow through each layer and make sense of what's happening. Let's run this one last time and see what the results look like. We'll increase the epoch count since deeper networks tend to take longer to train.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=40, validation_data=(x_test, y_test), workers=4)

Epoch 1/40
196/196 [==============================] - 28s 145ms/step - loss: 1.9806 - acc: 0.3498 - val_loss: 7.4266 - val_acc: 0.0771
Epoch 2/40
196/196 [==============================] - 23s 118ms/step - loss: 1.5761 - acc: 0.4484 - val_loss: 2.0037 - val_acc: 0.3478
Epoch 3/40
196/196 [==============================] - 24s 124ms/step - loss: 1.5488 - acc: 0.4612 - val_loss: 14.3443 - val_acc: 0.1005
Epoch 4/40
196/196 [==============================] - 24s 122ms/step - loss: 1.6194 - acc: 0.4359 - val_loss: 2.5182 - val_acc: 0.2401
Epoch 5/40
196/196 [==============================] - 24s 121ms/step - loss: 1.5562 - acc: 0.4626 - val_loss: 2.0495 - val_acc: 0.3302
Epoch 6/40
196/196 [==============================] - 24s 121ms/step - loss: 1.6183 - acc: 0.4400 - val_loss: 2.9989 - val_acc: 0.1782
Epoch 7/40
196/196 [==============================] - 24s 121ms/step - loss: 1.4886 - acc: 0.4672 - val_loss: 1.3995 - val_acc: 0.4944
Epoch 8/40
196/196 [==============================] - 24s 121ms/step - loss: 1.3551 - acc: 0.5162 - val_loss: 1.3086 - val_acc: 0.5268
Epoch 9/40
196/196 [==============================] - 24s 123ms/step - loss: 1.2971 - acc: 0.5373 - val_loss: 1.2979 - val_acc: 0.5423
Epoch 10/40
196/196 [==============================] - 24s 121ms/step - loss: 1.2737 - acc: 0.5507 - val_loss: 8.2801 - val_acc: 0.1325
Epoch 11/40
196/196 [==============================] - 24s 123ms/step - loss: 1.3697 - acc: 0.5350 - val_loss: 1.2361 - val_acc: 0.5742
Epoch 12/40
196/196 [==============================] - 24s 121ms/step - loss: 1.2410 - acc: 0.5652 - val_loss: 1.1365 - val_acc: 0.6007
Epoch 13/40
196/196 [==============================] - 24s 121ms/step - loss: 1.1514 - acc: 0.5958 - val_loss: 1.1343 - val_acc: 0.6118
Epoch 14/40
196/196 [==============================] - 24s 122ms/step - loss: 1.1079 - acc: 0.6096 - val_loss: 1.1276 - val_acc: 0.6092
Epoch 15/40
196/196 [==============================] - 24s 121ms/step - loss: 1.0586 - acc: 0.6306 - val_loss: 1.0696 - val_acc: 0.6330
Epoch 16/40
196/196 [==============================] - 23s 119ms/step - loss: 1.0240 - acc: 0.6437 - val_loss: 1.0270 - val_acc: 0.6596
Epoch 17/40
196/196 [==============================] - 24s 122ms/step - loss: 0.9809 - acc: 0.6611 - val_loss: 1.0828 - val_acc: 0.6391
Epoch 18/40
196/196 [==============================] - 24s 121ms/step - loss: 0.9591 - acc: 0.6685 - val_loss: 0.9332 - val_acc: 0.6848
Epoch 19/40
196/196 [==============================] - 24s 122ms/step - loss: 0.9166 - acc: 0.6860 - val_loss: 0.9894 - val_acc: 0.6632
Epoch 20/40
196/196 [==============================] - 24s 121ms/step - loss: 0.8854 - acc: 0.6983 - val_loss: 1.1848 - val_acc: 0.6169
Epoch 21/40
196/196 [==============================] - 24s 122ms/step - loss: 0.8659 - acc: 0.7045 - val_loss: 0.9105 - val_acc: 0.6978
Epoch 22/40
196/196 [==============================] - 24s 122ms/step - loss: 0.8366 - acc: 0.7162 - val_loss: 0.8779 - val_acc: 0.7132
Epoch 23/40
196/196 [==============================] - 23s 120ms/step - loss: 0.8175 - acc: 0.7252 - val_loss: 1.8874 - val_acc: 0.5708
Epoch 24/40
196/196 [==============================] - 24s 120ms/step - loss: 0.8383 - acc: 0.7203 - val_loss: 0.9611 - val_acc: 0.6878
Epoch 25/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7910 - acc: 0.7360 - val_loss: 0.8956 - val_acc: 0.7037
Epoch 26/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7728 - acc: 0.7445 - val_loss: 0.8712 - val_acc: 0.7297
Epoch 27/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7532 - acc: 0.7514 - val_loss: 0.8697 - val_acc: 0.7191
Epoch 28/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7419 - acc: 0.7568 - val_loss: 0.7995 - val_acc: 0.7405
Epoch 29/40
196/196 [==============================] - 24s 122ms/step - loss: 0.7385 - acc: 0.7599 - val_loss: 0.8080 - val_acc: 0.7451
Epoch 30/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7202 - acc: 0.7663 - val_loss: 0.9121 - val_acc: 0.7253
Epoch 31/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7078 - acc: 0.7737 - val_loss: 0.8999 - val_acc: 0.7223
Epoch 32/40
196/196 [==============================] - 24s 120ms/step - loss: 0.6969 - acc: 0.7756 - val_loss: 0.9682 - val_acc: 0.7135
Epoch 33/40
196/196 [==============================] - 24s 121ms/step - loss: 0.6851 - acc: 0.7825 - val_loss: 0.8145 - val_acc: 0.7456
Epoch 34/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6800 - acc: 0.7859 - val_loss: 0.7972 - val_acc: 0.7585
Epoch 35/40
196/196 [==============================] - 23s 118ms/step - loss: 0.6689 - acc: 0.7919 - val_loss: 0.7807 - val_acc: 0.7654
Epoch 36/40
196/196 [==============================] - 24s 122ms/step - loss: 0.6626 - acc: 0.7949 - val_loss: 0.8022 - val_acc: 0.7509
Epoch 37/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6550 - acc: 0.7987 - val_loss: 0.8129 - val_acc: 0.7613
Epoch 38/40
196/196 [==============================] - 24s 122ms/step - loss: 0.6532 - acc: 0.8006 - val_loss: 0.8861 - val_acc: 0.7359
Epoch 39/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6419 - acc: 0.8043 - val_loss: 0.8233 - val_acc: 0.7568
Epoch 40/40
196/196 [==============================] - 24s 124ms/step - loss: 0.6308 - acc: 0.8109 - val_loss: 0.7809 - val_acc: 0.7670

The results look pretty good. We're starting to hit the point where accuracy improvements are getting harder to come by. It's definitely possible to keep improving the model with the right tuning and augmentation strategies, however diminishing returns start to kick in relative to the effort involved. Also, as the network keeps getting bigger (and as we graduate to larger and more complex data sets) it starts becoming much, much harder to build a network from scratch.

Fortunately there's an alternative solution via transfer learning, which takes a model trained on one task and adapts it to another task. Combined with pre-training, which is the practice of using a model that's already been trained for a given task, we can take very large networks developed by i.e. Google and Facebook and then fine-tune them to work in a custom domain of our choosing. Below I'll walk through an example of how this works by using a pre-trained ImageNet model and adapting it to Kaggle's dogs vs cats data set.

First get some imports out of the way. We'll need all of this stuff throughout the exercise.

import numpy as np
from keras.applications import ResNet50
from keras.applications.resnet50 import preprocess_input
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import RMSprop
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator

The easiest way to get the data set is via fast.ai's servers, where they've graciously hosted a single zip file with everything we need. Extract this to a directory somewhere on your machine and update the "PATH" variable below, and you should be good to go. We can also specify a few useful constants such as the image dimension and batch size.

PATH = '/home/paperspace/data/dogscats/'
train_dir = f'{PATH}train'
valid_dir = f'{PATH}valid'
size = 224
batch_size = 64

Next we need a generator to apply transformations to the images. As before, we can use the generator Keras has built-in. The only wrinkle is using a specalized preprocessing function designed for ImageNet-like source data (this also comes with Keras and was imported above).

train_datagen = ImageDataGenerator(
    shear_range=0.2,
    zoom_range=0.2,
    preprocessing_function=preprocess_input,
    horizontal_flip=True)

val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

With CIFAR 10 we had the whole data set loaded into memory, but that strategy usually isn't feasible for larger image databases. In this case we have a bunch of image files in folders on disk as our starting point, and to run a model over these images we want to be able to stream images into memory in batches rather than load everything at once. Fortunately Keras can also handle this scenario natively using the "flow_from_directory" function. We just need to specify the directory, image size, and batch size.

train_generator = train_datagen.flow_from_directory(train_dir,
    target_size=(size, size),
    batch_size=batch_size, class_mode='binary')

val_generator = val_datagen.flow_from_directory(valid_dir,
    shuffle=False,
    target_size=(size, size),
    batch_size=batch_size, class_mode='binary')

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.

For the model, we'll use the ResNet-50 architecture with pre-trained weights. ResNet-50 is a 168-layer architecture that achieved 92% top-5 accuracy on ImageNet classification. Keras provides both the model architecture and an option to use existing weights out of the box. The other notable parameter in the model initializer is "include_top", which indicates if we want to include the fully-connected layer at the top of the network. In our case the answer is no, because we want to "hook into" the model after the last residual block and add our own architecture on top.

base_model = ResNet50(weights='imagenet', include_top=False)
x = base_model.output

After instantiating the pre-trained ResNet-50 model, we can start adding new layers to the architecture. Let's start with a pooling layer to normalize the tensor shape, then add a fully-connected layer of our own. Finally, we'll use a sigmoid unit for class probability since the task is binary (cat or dog).

x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)

Before finishing the model definition and compiling, there's one more notable step. We need to prevent the "base" layers of the model from participating in the weight update phase of training while we "break in" the new layers we just added. Since each layer in a Keras model has a "trainable" property, we can just set it to false for all layers in the base architecture.

(Aside: There is apparently some funkiness to using this approach in models that have batch norm layers that can lead to sub-optimal results, especially when doing fine-tuning which we'll get to in a few steps. I haven't seen a conclusive answer on how to deal with this, and the niave approach seems to work okay for this problem, so I'm not doing anything special to deal with it here but I wanted to point it out as a potential issue one might run into. There's a lengthly discussion on the subject here).

model = Model(inputs=base_model.input, outputs=preds)
for layer in base_model.layers: layer.trainable = False
model.compile(optimizer=RMSprop(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])

Training should be pretty familiar, the only wrinkle here is we need to specify the number of batches in an epoch when using the "flow_from_directory" generator.

history = model.fit_generator(train_generator,
    train_generator.n // batch_size, epochs=3, workers=4,
    validation_data=val_generator,
    validation_steps=val_generator.n // batch_size)

Epoch 1/3
359/359 [==============================] - 128s 357ms/step - loss: 0.1738 - acc: 0.9506 - val_loss: 0.0694 - val_acc: 0.9839
Epoch 2/3
359/359 [==============================] - 123s 342ms/step - loss: 0.0809 - acc: 0.9729 - val_loss: 0.1059 - val_acc: 0.9778
Epoch 3/3
359/359 [==============================] - 123s 344ms/step - loss: 0.0717 - acc: 0.9755 - val_loss: 0.1411 - val_acc: 0.9723

These results aren't too bad even with the entire base architecture held constant. This is partly due to the fact that the training images are quite similar to the images that the architecture was trained on. If we were fitting the model on something totally different, say medical image classification for instance, transfer learning would still work but it wouldn't be this easy.

The next step is to fine-tune some of the base model by "unfreezing" parts of it and allowing them to update weights during training. I'm not aware if there are any best practices for fine-tuning or not. I think it's generally a lot of trial and error. For this attempt, I unfroze the last residual block in the network and lowered the learning rate by an order of magnitude.

for layer in model.layers[:142]: layer.trainable = False
for layer in model.layers[142:]: layer.trainable = True
model.compile(optimizer=RMSprop(lr=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit_generator(train_generator,
    train_generator.n // batch_size, epochs=3, workers=4,
    validation_data=val_generator,
    validation_steps=val_generator.n // batch_size)

Epoch 1/3
359/359 [==============================] - 151s 421ms/step - loss: 0.0468 - acc: 0.9826 - val_loss: 1.0175 - val_acc: 0.9098
Epoch 2/3
359/359 [==============================] - 146s 406ms/step - loss: 0.0293 - acc: 0.9903 - val_loss: 0.1305 - val_acc: 0.9829
Epoch 3/3
359/359 [==============================] - 146s 406ms/step - loss: 0.0211 - acc: 0.9938 - val_loss: 0.1197 - val_acc: 0.9849

This technique is very powerful and is probably almost always a better idea than starting from scratch if there's a model out there that is at least somewhat similar to the thing you're trying to accomplish. Currently transfer learning is mostly being applied to image models, although it's quickly taking over language models as well.

That wraps up this post on convolutional networks. In the next post in this series we'll see how to use a deep learning framework like Keras to build a recommendation system. Don't miss it!

Deep Learning With Keras: Structured Time Series

John Wittenauer — Sun, 14 Oct 2018 18:00:41 GMT

This post marks the beginning of what I hope to become a series covering practical, real-world implementations using deep learning. What sparked my motivation to do a series like this was Jeremy Howard's awesome fast.ai courses, which show how to use deep learning to achieve world class performance from scratch in a number of different domains. The one quibble I had with the class content was its use of a custom wrapper library, which I felt masked a lot of the difficulty early on by hiding the complexity in a function (or several layers of functions). I understand that this approach was used deliberately as a teaching mechanism, and that's fine, but I wanted to see what these exercises looked like without having that crutch. I also have a bit more experience with Keras than with PyTorch, and while both are great libraries, my preference at the moment is still Keras for most tasks.

My plan for each installment of the series is to take a topic from Jeremy's class and see if I can achieve similar results using nothing but Keras and other common libraries. We won't get into the theory or math of neural networks much, there are lots of great (free) resources on the internet now to cover that aspect of it if that's what you're looking for. Instead, this series will focus heavily on writing code. I won't gloss over important pre-processing steps or assume tricky details are taken care of behind the scenes. A lot of the challenge to using neural networks in practice resides in many of these often-neglected details, after all. By the end of each post, we should have a sequence of steps that you can reliably run yourself (assuming dependencies are met and API versions are the same) to produce a result that's similar to what I show here. With that, let's dive in!

Today we'll walk through an implementation of a deep learning model for structured time series data. We’ll use the data from Kaggle’s Rossmann Store Sales competition. The steps outlined below are inspired by (and partially based on) lesson 3 from Jeremy's course.

The focus here is on implementing a deep learning model for structured data. I’ve skipped a bunch of pre-processing steps that are specific to this particular data but don’t reflect general principles about applying deep learning to tabular data. If you’re interested, you’ll find complete step-by-step instructions on creating the “joined” data in the notebook I linked to above. I know I just got done saying that I won't skip pre-processing steps, but I'm mainly talking about stuff we apply to the data to work with deep learning, not sourcing the data to begin with.

(As an aside, I used Paperspace to run everything in this post. If you’re not familiar with it, Paperspace is a cloud service that lets you rent GPU instances much cheaper than AWS. It’s a great way to get started if you don’t have your own hardware.)

First we need to get a few imports out of the way. All of these should come standard with an Anaconda install. I’m also specifying the path where I’ve pre-saved the “joined” data that we’ll use as a starting point. If you're starting from scratch, this comes from running the first half of Jeremy's lesson 3 notebook.

%matplotlib inline
import datetime
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler

PATH = '/home/paperspace/data/rossmann/'

Read the data file into a pandas dataframe and take a peek at the data to see what we’re working with.

data = pd.read_feather(f'{PATH}joined')
data.shape

(844338, 93)

We can also take a look at the first couple rows in the data with this trick that transposes each row into a column (93 is the number of columns).

data.head().T.head(93)

The data consists of ~800,000 records with a variety of features used to predict sales at a given store on a given day. As mentioned before, we’re skipping over details about where these features came from as it’s not the focus of this notebook, but you can find more information through the links above. Next we’ll define variables that group the features into continuous and categorical buckets. This is very important as neural networks (really anything other than tree models) do not natively handle categorical data well.

target = 'Sales'
cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday',
            'CompetitionMonthsOpen', 'Promo2Weeks', 'StoreType', 'Assortment',
            'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
            'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw',
            'StateHoliday_bw', 'SchoolHoliday_fw', 'SchoolHoliday_bw']
cont_vars = ['CompetitionDistance', 'Max_TemperatureC', 'Mean_TemperatureC',
             'Min_TemperatureC', 'Max_Humidity', 'Mean_Humidity', 'Min_Humidity',
             'Max_Wind_SpeedKm_h', 'Mean_Wind_SpeedKm_h', 'CloudCover', 'trend',
             'trend_DE', 'AfterStateHoliday', 'BeforeStateHoliday', 'Promo',
             'SchoolHoliday']

Set some reasonable default values for missing information so our pre-processing steps won’t fail.

data = data.set_index('Date')
data[cat_vars] = data[cat_vars].fillna(value='')
data[cont_vars] = data[cont_vars].fillna(value=0)

Now we can do something with the categorical variables. The simplest first step is to use scikit-learn’s LabelEncoder class to transform the raw category values (many of which are plain text) into unique integers, where each integer maps to a distinct value in that category. The code block below saves the fitted encoders (we’ll need them later) and prints out the unique labels that each encoder found.

encoders = {}
for v in cat_vars:
    le = LabelEncoder()
    le.fit(data[v].values)
    encoders[v] = le
    data.loc[:, v] = le.transform(data[v].values)
    print('{0}: {1}'.format(v, le.classes_))

Store: [   1    2    3 ... 1113 1114 1115]
DayOfWeek: [1 2 3 4 5 6 7]
Year: [2013 2014 2015]
Month: [ 1  2  3  4  5  6  7  8  9 10 11 12]
Day: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31]
StateHoliday: [False  True]
CompetitionMonthsOpen: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24]
Promo2Weeks: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25]
StoreType: ['a' 'b' 'c' 'd']
Assortment: ['a' 'b' 'c']
PromoInterval: ['' 'Feb,May,Aug,Nov' 'Jan,Apr,Jul,Oct' 'Mar,Jun,Sept,Dec']
CompetitionOpenSinceYear: [1900 1961 1990 1994 1995 1998 1999 2000 2001 2002 2003 2004 2005 2006
 2007 2008 2009 2010 2011 2012 2013 2014 2015]
Promo2SinceYear: [1900 2009 2010 2011 2012 2013 2014 2015]
State: ['BE' 'BW' 'BY' 'HB,NI' 'HE' 'HH' 'NW' 'RP' 'SH' 'SN' 'ST' 'TH']
Week: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52]
Events: ['' 'Fog' 'Fog-Rain' 'Fog-Rain-Hail' 'Fog-Rain-Hail-Thunderstorm'
 'Fog-Rain-Snow' 'Fog-Rain-Snow-Hail' 'Fog-Rain-Thunderstorm' 'Fog-Snow'
 'Fog-Snow-Hail' 'Fog-Thunderstorm' 'Rain' 'Rain-Hail'
 'Rain-Hail-Thunderstorm' 'Rain-Snow' 'Rain-Snow-Hail'
 'Rain-Snow-Hail-Thunderstorm' 'Rain-Snow-Thunderstorm'
 'Rain-Thunderstorm' 'Snow' 'Snow-Hail' 'Thunderstorm']
Promo_fw: [0. 1. 2. 3. 4. 5.]
Promo_bw: [0. 1. 2. 3. 4. 5.]
StateHoliday_fw: [0. 1. 2.]
StateHoliday_bw: [0. 1. 2.]
SchoolHoliday_fw: [0. 1. 2. 3. 4. 5. 6. 7.]
SchoolHoliday_bw: [0. 1. 2. 3. 4. 5. 6. 7.]

Split the data set into training and validation sets. To preserve the temporal nature of the data and make sure that we don’t have any information leaks, we’ll just take everything past a certain date and use that as our validation set.

train = data[data.index < datetime.datetime(2015, 7, 1)]
val = data[data.index >= datetime.datetime(2015, 7, 1)]

X = train[cat_vars + cont_vars].copy()
X_val = val[cat_vars + cont_vars].copy()
y = train[target].copy()
y_val = val[target].copy()

Next we can apply scaling to our continuous variables. We can once again leverage scikit-learn and use the StandardScaler class for this. The proper way to apply scaling is to “fit” the scaler on the training data and then apply the same transformation to both the training and validation data (this is why we had to split the data set in the last step).

scaler = StandardScaler()
X.loc[:, cont_vars] = scaler.fit_transform(X[cont_vars].values)
X_val.loc[:, cont_vars] = scaler.transform(X_val[cont_vars].values)

Normalize the data types that each variable is stored as. This is not strictly necessary but helps save storage space (and potentially processing time, although I’m less sure about that).

for v in cat_vars:
    X[v] = X[v].astype('int').astype('category').cat.as_ordered()
    X_val[v] = X_val[v].astype('int').astype('category').cat.as_ordered()
for v in cont_vars:
    X[v] = X[v].astype('float32')
    X_val[v] = X_val[v].astype('float32')

Let’s take a look at where we’re at. The data should basically be ready to move into the modeling phase.

X.shape, X_val.shape, y.shape, y_val.shape

((814150, 38), (30188, 38), (814150,), (30188,))

X.dtypes

Store                       category
DayOfWeek                   category
Year                        category
Month                       category
Day                         category
StateHoliday                category
CompetitionMonthsOpen       category
Promo2Weeks                 category
StoreType                   category
Assortment                  category
PromoInterval               category
CompetitionOpenSinceYear    category
Promo2SinceYear             category
State                       category
Week                        category
Events                      category
Promo_fw                    category
Promo_bw                    category
StateHoliday_fw             category
StateHoliday_bw             category
SchoolHoliday_fw            category
SchoolHoliday_bw            category
CompetitionDistance          float32
Max_TemperatureC             float32
Mean_TemperatureC            float32
Min_TemperatureC             float32
Max_Humidity                 float32
Mean_Humidity                float32
Min_Humidity                 float32
Max_Wind_SpeedKm_h           float32
Mean_Wind_SpeedKm_h          float32
CloudCover                   float32
trend                        float32
trend_DE                     float32
AfterStateHoliday            float32
BeforeStateHoliday           float32
Promo                        float32
SchoolHoliday                float32

We now basically have two options when it comes to handling of categorical variables. The first option, which is the “traditional” way of handling categories, is to do a one-hot encoding for each category. This approach would create a binary variable for each unique value in each category, with the value being a 1 for the “correct” category and 0 for everything else. One-hot encoding works fairly well and is quite easy to do (there’s even a scikit-learn class for it), however it’s not perfect. It’s particularly challenging with high-cardinality variables because it creates a very large, very sparse array that’s hard to learn from.

Fortunately there’s a better way, which is something called entity embeddings or category embeddings (I don’t think there’s a standard name for this yet). Jeremy covers it extensively in the class (also this blog post explains it very well). The basic idea is to create a distributed representation of the category using a vector of continuous numbers, where the length of the vector is lower than the cardinality of the category. The key insight is that this vector is learned by the network. It’s part of the optimization graph. This allows the network to model complex, non-linear interactions between categories and other features in your input. It’s quite useful, and as we’ll see at the end, these embeddings can be used in interesting ways outside of the neural network itself.

In order to build a model using embeddings, we need to do some more prep work on our categories. First, let’s create a list of category names along with their cardinality.

cat_sizes = [(c, len(X[c].cat.categories)) for c in cat_vars]
cat_sizes

[('Store', 1115),
 ('DayOfWeek', 7),
 ('Year', 3),
 ('Month', 12),
 ('Day', 31),
 ('StateHoliday', 2),
 ('CompetitionMonthsOpen', 25),
 ('Promo2Weeks', 26),
 ('StoreType', 4),
 ('Assortment', 3),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 23),
 ('Promo2SinceYear', 8),
 ('State', 12),
 ('Week', 52),
 ('Events', 22),
 ('Promo_fw', 6),
 ('Promo_bw', 6),
 ('StateHoliday_fw', 3),
 ('StateHoliday_bw', 3),
 ('SchoolHoliday_fw', 8),
 ('SchoolHoliday_bw', 8)]

Now we need to decide on the length of each embedding vector. Jeremy proposed using a simple formula: cardinality / 2, with a max of 50.

embedding_sizes = [(c, min(50, (c + 1) // 2)) for _, c in cat_sizes]
embedding_sizes

[(1115, 50),
 (7, 4),
 (3, 2),
 (12, 6),
 (31, 16),
 (2, 1),
 (25, 13),
 (26, 13),
 (4, 2),
 (3, 2),
 (4, 2),
 (23, 12),
 (8, 4),
 (12, 6),
 (52, 26),
 (22, 11),
 (6, 3),
 (6, 3),
 (3, 2),
 (3, 2),
 (8, 4),
 (8, 4)]

One last pre-processing step. Keras requires that each “input” into the model be fed in as a separate array, and since each embedding has its own input, we need to do some transformations to get the data in the right format.

X_array = []
X_val_array = []

for i, v in enumerate(cat_vars):
    X_array.append(X.iloc[:, i])
    X_val_array.append(X_val.iloc[:, i])

X_array.append(X.iloc[:, len(cat_vars):])
X_val_array.append(X_val.iloc[:, len(cat_vars):])

len(X_array), len(X_val_array)

(23, 23)

Okay! We’re finally ready to get to the modeling part. Let’s get some imports out of the way. I’ve also defined a custom metric to calculate root mean squared percentage error, which was originally used by the Kaggle competition to score this data set.

from keras import backend as K
from keras import regularizers
from keras.models import Sequential
from keras.models import Model
from keras.layers import Activation, BatchNormalization, Concatenate
from keras.layers import Dropout, Dense, Input, Reshape
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau

def rmspe(y_true, y_pred):
    pct_var = (y_true - y_pred) / y_true
    return K.sqrt(K.mean(K.square(pct_var)))

Now for the model itself. I tried to make this as similar to Jeremy’s model as I could, although there are some slight differences. The “for” section at the top shows how to add embeddings. They then get concatenated together and we apply dropout to the unified embedding layer. Next we concatenate the output of that layer with our continuous inputs and feed the whole thing into a dense layer. From here on it’s pretty standard stuff. The only notable design choice is I omitted batch normalization because it seemed to hurt performance no matter what I did. I also increased dropout a bit from what Jeremy had in his PyTorch architecture for this data. Finally, note the inclusion of the “rmspe” function as a metric during the compile step (this will show up later during training).

def EmbeddingNet(cat_vars, cont_vars, embedding_sizes):
    inputs = []
    embed_layers = []
    for (c, (in_size, out_size)) in zip(cat_vars, embedding_sizes):
        i = Input(shape=(1,))
        o = Embedding(in_size, out_size, name=c)(i)
        o = Reshape(target_shape=(out_size,))(o)
        inputs.append(i)
        embed_layers.append(o)

    embed = Concatenate()(embed_layers)
    embed = Dropout(0.04)(embed)

    cont_input = Input(shape=(len(cont_vars),))
    inputs.append(cont_input)

    x = Concatenate()([embed, cont_input])

    x = Dense(1000, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.1)(x)

    x = Dense(500, kernel_initializer='he_normal')(x)
    x = Activation('relu')(x)
    x = Dropout(0.1)(x)

    x = Dense(1, kernel_initializer='he_normal')(x)
    x = Activation('linear')(x)

    model = Model(inputs=inputs, outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_absolute_error', optimizer=opt, metrics=[rmspe])

    return model

One of the cool tricks Jeremy introduced in the class was the concept of a learning rate finder. The idea is to start with a very small learning rate and slowly increase it throughout the epoch, and monitor the loss along the way. It should end up as a curve that gives a good indication of where to set the learning rate for training. To accomplish this with Keras, I found a script on Github that implements learning rate cycling and includes a class that’s supposed to mimic Jeremy’s LR finder. We can just download a copy to the local directory.

!wget "https://raw.githubusercontent.com/titu1994/keras-one-cycle/master/clr.py"

--2018-10-04 20:20:17--  https://raw.githubusercontent.com/titu1994/keras-one-cycle/master/clr.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22310 (22K) [text/plain]
Saving to: ‘clr.py’

clr.py              100%[===================>]  21.79K  --.-KB/s    in 0.009s  

2018-10-04 20:20:17 (2.35 MB/s) - ‘clr.py’ saved [22310/22310]

Let’s set up and train the model for one epoch using the LRFinder class as a callback. It will slowly but exponentially increase the learning rate each batch and track the loss so we can plot the results.

lr_finder = LRFinder(num_samples=X.shape[0], batch_size=128, minimum_lr=1e-5,
                     maximum_lr=10, lr_scale='exp', loss_smoothing_beta=0.995,
                     verbose=False)
model = EmbeddingNet(cat_vars, cont_vars, embedding_sizes)
history = model.fit(x=X_array, y=y, batch_size=128, epochs=1, verbose=1,
                    callbacks=[lr_finder], validation_data=(X_val_array, y_val),
                    shuffle=False)

Train on 814150 samples, validate on 30188 samples
Epoch 1/1
814150/814150 [==============================] - 73s 90us/step - loss: 2521.7429 - rmspe: 0.4402 - val_loss: 3441.1762 - val_rmspe: 0.5088

lr_finder.plot_schedule(clip_beginning=20)

It doesn’t look as good as the plot Jeremy used in the class. The PyTorch version seemed to make it much more apparent where the loss started to level off. I haven’t dug into this too closely but I’m guessing there are some "tricks" in that version that we aren't using. If I had to eyeball this I’d say it’s recommending 1e-4 for the learning rate, but Jeremy used 1e-3 so we’ll go with that instead.

We’re now ready to train the model. I’ve included two callbacks (both built into Keras) to demonstrate how they work. The first one automatically reduces the learning rate as we progress through training if the validation error stops improving. The second one will save a copy of the model weights to a file every time we reach a new low in validation error.

model = EmbeddingNet(cat_vars, cont_vars, embedding_sizes)
lr_reducer = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3,
                               verbose=1, mode='auto', min_delta=10, cooldown=0,
                               min_lr=0.0001)
checkpoint = ModelCheckpoint('best_model_weights.hdf5', monitor='val_loss',
                             save_best_only=True)
history = model.fit(x=X_array, y=y, batch_size=128, epochs=20, verbose=1,
                    callbacks=[lr_reducer, checkpoint],
                    validation_data=(X_val_array, y_val), shuffle=False)

Train on 814150 samples, validate on 30188 samples
Epoch 1/20
814150/814150 [==============================] - 68s 83us/step - loss: 1138.6056 - rmspe: 0.2421 - val_loss: 1923.3162 - val_rmspe: 0.3177
Epoch 2/20
814150/814150 [==============================] - 66s 81us/step - loss: 962.1155 - rmspe: 0.2140 - val_loss: 1895.0041 - val_rmspe: 0.3015
Epoch 3/20
814150/814150 [==============================] - 66s 80us/step - loss: 850.5718 - rmspe: 0.1899 - val_loss: 1551.5644 - val_rmspe: 0.2554
Epoch 4/20
814150/814150 [==============================] - 66s 81us/step - loss: 760.7246 - rmspe: 0.1607 - val_loss: 1589.6841 - val_rmspe: 0.2556
Epoch 5/20
814150/814150 [==============================] - 66s 81us/step - loss: 723.1884 - rmspe: 0.1522 - val_loss: 2032.6661 - val_rmspe: 0.3093
Epoch 6/20
814150/814150 [==============================] - 66s 81us/step - loss: 701.6135 - rmspe: 0.1470 - val_loss: 1559.3813 - val_rmspe: 0.2455

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
Epoch 7/20
814150/814150 [==============================] - 66s 81us/step - loss: 759.7100 - rmspe: 0.1551 - val_loss: 1363.9912 - val_rmspe: 0.2134
Epoch 8/20
814150/814150 [==============================] - 66s 81us/step - loss: 687.3188 - rmspe: 0.1445 - val_loss: 1238.6456 - val_rmspe: 0.1987
Epoch 9/20
814150/814150 [==============================] - 66s 82us/step - loss: 664.9696 - rmspe: 0.1411 - val_loss: 1156.7629 - val_rmspe: 0.1894
Epoch 10/20
814150/814150 [==============================] - 66s 81us/step - loss: 648.3002 - rmspe: 0.1383 - val_loss: 1085.9985 - val_rmspe: 0.1804
Epoch 11/20
814150/814150 [==============================] - 66s 81us/step - loss: 634.7324 - rmspe: 0.1358 - val_loss: 1046.5626 - val_rmspe: 0.1764
Epoch 12/20
814150/814150 [==============================] - 66s 81us/step - loss: 620.5305 - rmspe: 0.1331 - val_loss: 998.0284 - val_rmspe: 0.1702
Epoch 13/20
814150/814150 [==============================] - 66s 80us/step - loss: 608.7635 - rmspe: 0.1308 - val_loss: 972.2079 - val_rmspe: 0.1672
Epoch 14/20
814150/814150 [==============================] - 66s 81us/step - loss: 596.7082 - rmspe: 0.1287 - val_loss: 944.8604 - val_rmspe: 0.1627
Epoch 15/20
814150/814150 [==============================] - 66s 81us/step - loss: 585.2907 - rmspe: 0.1265 - val_loss: 902.0995 - val_rmspe: 0.1568
Epoch 16/20
814150/814150 [==============================] - 66s 81us/step - loss: 575.5892 - rmspe: 0.1246 - val_loss: 854.3993 - val_rmspe: 0.1492
Epoch 17/20
814150/814150 [==============================] - 66s 81us/step - loss: 566.3440 - rmspe: 0.1228 - val_loss: 817.1876 - val_rmspe: 0.1438
Epoch 18/20
814150/814150 [==============================] - 66s 81us/step - loss: 558.5853 - rmspe: 0.1214 - val_loss: 767.2299 - val_rmspe: 0.1369
Epoch 19/20
814150/814150 [==============================] - 66s 81us/step - loss: 550.4629 - rmspe: 0.1200 - val_loss: 730.3196 - val_rmspe: 0.1317
Epoch 20/20
814150/814150 [==============================] - 66s 81us/step - loss: 542.9558 - rmspe: 0.1188 - val_loss: 698.6143 - val_rmspe: 0.1278

By the end it’s doing pretty good, and it looks like the model is still improving. We can quickly get a snapshot of its performance using the “history” object that Keras’s "fit" method returns.

loss_history = history.history['loss']
val_loss_history = history.history['val_loss']
min_val_epoch = val_loss_history.index(min(val_loss_history)) + 1

print('min training loss = {0}'.format(min(loss_history)))
print('min val loss = {0}'.format(min(val_loss_history)))
print('min val epoch = {0}'.format(min_val_epoch))

min training loss = 542.9558401937004
min val loss = 698.6142525395542
min val epoch = 20

I also like to make plots to visually see what’s going on. Let’s create a function that plots the training and validation loss history.

from jupyterthemes import jtplot
jtplot.style()

def plot_loss_history(history, n_epochs):
    fig, ax = plt.subplots(figsize=(8, 8 * 3 / 4))
    ax.plot(list(range(n_epochs)), history.history['loss'], label='Training Loss')
    ax.plot(list(range(n_epochs)), history.history['val_loss'], label='Validation Loss')
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.legend(loc='upper right')
    fig.tight_layout()

plot_loss_history(history, 20)

The validation loss was pretty unstable early on but was really starting to converge toward the end of training. We can do something similar for the learning rate history.

def plot_learning_rate(history):
    fig, ax = plt.subplots(figsize=(8, 8 * 3 / 4))
    ax.set_xlabel('Training Iterations')
    ax.set_ylabel('Learning Rate')
    ax.plot(history.history['lr'])
    fig.tight_layout()

plot_learning_rate(history)

One other innovation Jeremy introduced in the class is the idea of using learning rate cycles to help prevent the model from settling in a bad local minimum. This is based on research by Leslie Smith that showed using this type of learning rate policy can lead to quicker convergence and better accuracy (this is also where the learning rate finder idea came from). Fortunately the file we downloaded earlier includes support for cyclical learning rates in Keras, so we can try this out ourselves. The policy Jeremy is currently recommending is called a “one-cycle” policy so that’s what we’ll try.

(As an aside, Jeremy wrote a blog post about this if you'd like to dig into its origins a bit more. His results applying it to ImageNet were quite impressive.)

model2 = EmbeddingNet(cat_vars, cont_vars, embedding_sizes)
batch_size = 128
n_epochs = 10
lr_manager = OneCycleLR(num_samples=X.shape[0] + batch_size, num_epochs=n_epochs,
                        batch_size=batch_size, max_lr=0.01, end_percentage=0.1,
                        scale_percentage=None, maximum_momentum=None,
                        minimum_momentum=None, verbose=False)
history = model2.fit(x=X_array, y=y, batch_size=batch_size, epochs=n_epochs,
                     verbose=1, callbacks=[checkpoint, lr_manager],
                     validation_data=(X_val_array, y_val), shuffle=False)

Train on 814150 samples, validate on 30188 samples
Epoch 1/10
814150/814150 [==============================] - 76s 93us/step - loss: 1115.8234 - rmspe: 0.2384 - val_loss: 1625.4826 - val_rmspe: 0.2847
Epoch 2/10
814150/814150 [==============================] - 74s 90us/step - loss: 853.5083 - rmspe: 0.1828 - val_loss: 1308.4618 - val_rmspe: 0.2416
Epoch 3/10
814150/814150 [==============================] - 73s 90us/step - loss: 800.1833 - rmspe: 0.1622 - val_loss: 1379.4527 - val_rmspe: 0.2425
Epoch 4/10
814150/814150 [==============================] - 74s 91us/step - loss: 820.6853 - rmspe: 0.1627 - val_loss: 1353.2198 - val_rmspe: 0.2386
Epoch 5/10
814150/814150 [==============================] - 73s 90us/step - loss: 823.7708 - rmspe: 0.1641 - val_loss: 1423.9368 - val_rmspe: 0.2440
Epoch 6/10
814150/814150 [==============================] - 74s 90us/step - loss: 778.9107 - rmspe: 0.1548 - val_loss: 1425.7734 - val_rmspe: 0.2449
Epoch 7/10
814150/814150 [==============================] - 73s 90us/step - loss: 760.5194 - rmspe: 0.1508 - val_loss: 1324.7112 - val_rmspe: 0.2273
Epoch 8/10
814150/814150 [==============================] - 74s 91us/step - loss: 734.5933 - rmspe: 0.1464 - val_loss: 1449.1921 - val_rmspe: 0.2401
Epoch 9/10
814150/814150 [==============================] - 74s 91us/step - loss: 750.8221 - rmspe: 0.1491 - val_loss: 2127.6987 - val_rmspe: 0.3179
Epoch 10/10
814150/814150 [==============================] - 74s 91us/step - loss: 750.6736 - rmspe: 0.1500 - val_loss: 1375.3424 - val_rmspe: 0.2121

As you can probably tell from the model error, I didn’t have a lot of success with this strategy. I tried a few different configurations and nothing really worked, but I wouldn’t say it’s an indictment of the technique so much as it just didn’t happen to do well within the narrow scope that I attempted to apply it. Nevertheless, I’m definitely adding it to my toolbox for future reference.

If my earlier description wasn’t clear, this is how the learning is supposed to evolve over time. It forms a triangle from the starting point, coming back to the original learning rate towards the end and then decaying further as training wraps up.

plot_learning_rate(lr_manager)

One last trick worth discussing is what we can do with the embeddings that the network learned. Similar to word embeddings, these vectors contain potentially interesting information about how the values in each category relate to each other. One really simple way to see this visually is to do a PCA transform on the learned embedding weights and plot the first two dimensions. Let’s create a function to do just that.

def plot_embedding(model, encoders, category):
    embedding_layer = model.get_layer(category)
    weights = embedding_layer.get_weights()[0]
    pca = PCA(n_components=2)
    weights = pca.fit_transform(weights)
    weights_t = weights.T
    fig, ax = plt.subplots(figsize=(8, 8 * 3 / 4))
    ax.scatter(weights_t[0], weights_t[1])
    for i, day in enumerate(encoders[category].classes_):
        ax.annotate(day, (weights_t[0, i], weights_t[1, i]))
        fig.tight_layout()

We can now plot any categorical variable in the model and get a sense of which categories are more or less similar to each other. For instance, if we examine "day of week", it seems to have picked up that Sunday (7 on the chart) is quite different than every other day for store sales. And if we look at "state" (this data is for a German company BTW) there’s probably some regional similarity to the cluster in the bottom left. It’s a really cool technique that potentially has a wide range of uses.

plot_embedding(model, encoders, 'DayOfWeek')

plot_embedding(model, encoders, 'State')

That pretty much wraps things up! Using deep learning to model structured data is really under-discussed but seems likely to have huge potential since so many data scientists spend so much of their time working with data that looks like this. The use of embeddings in particular feels like a bit of a game-changer on high-cardinality categories. Hopefully breaking it down step by step will help a few of you out there figure out how to adapt deep learning to your problem domain.

The Lean Startup

John Wittenauer — Tue, 10 Jul 2018 00:20:20 GMT

This post is about Eric Ries's book "The Lean Startup". This book's message has a wider applicability than one might think, because contrary to what the title suggests, its methodology applies to more than just startups. There's a telling phase that Eric uses to define a startup, which is “a human institution designed to create a new product or service under conditions of extreme uncertainty”. I love this definition because it says nothing about what the institution looks like or how big it is. In fact, lean startup principles very often get applied to small "innovation teams" within large, otherwise slow-moving and highly bureaucratic companies. Reis talks about this in the look, stating that there are basically 3 things needed for innovation to thrive - scarce but secure resources, independent authority to develop the business, and a personal stake in the outcome. Teams with these characteristics can deliver surprisingly powerful results.

There are a couple key ideas to the lean startup approach that are worth discussing. The first is a concept Reis calls "validated learning". I think of validated learning as applying the scientific method to building a business. The idea is to systematically test your assumptions by treating everything as an experiment. Specifically, you should set up experiments for every assumption or decision in such a way that there is a non-ambiguous measurable component (defined a priori) that tells you if you were right or not. Every product, feature, marketing campaign etc. becomes an experiment. Done correctly, these experiments result in empirical demonstrations that valuable truths about the business’s prospects have been discovered.

This idea seems to mesh pretty well with what Nassim Taleb calls "tinkering", which is experimentation without necessarily having a clear thesis about what you hope to find. Tinkering is an asymmetric process because the upside of discovering something can be many orders of magnitude greater than the cost of doing the experiment. In Reis's model the experimentation is more strongly guided by prior assumptions, but the asymmetry still holds. I also think adopting this philosophy would lead to making predictions that are more easily testable (almost out of necessity), which is a good way to calibrate prior beliefs on a wide range of topics.

A related thread in the book is the idea of only doing work that's necessary to achieve validated learning. Anything that doesn't contribute to learning is a form of waste, so just do the stuff that's absolutely necessary to test your assumptions. I imagine this is harder to put into practice than it sounds. It's not at all trivial to identify what's really adding value when you're trying to make decisions day-to-day about what to focus on. There's probably a lot of gray areas and I doubt anyone actually, literally tracks 100% of their time and effort according to this metric, but it seems like a good heuristic.

Another key idea from the book is the concept of value vs. growth hypotheses. It goes like this - the two most important assumptions to make for a product are its value hypothesis (does it deliver value to customers once they're using it) and its growth hypothesis (how will new customers discover it). Every assumption falls into one of these categories. The value hypothesis must be proved before the growth hypothesis. If you've adopted the "validated learning" philosophy, this means that experiments testing the value hypothesis are essentially an early version of your product.

I like this distinction because it has a focusing effect when deciding what to work on. There's no point worrying about how you're going to grow if you don't have a product that anyone wants to use. This directly ties into two of the most famous (infamous?) concepts from the book, the minimum viable product (MVP) and "pivoting". An MVP is the smallest set of features that can be put together to test the value hypothesis. It's a bare-bones version of the product you hope to build. If the MVP doesn't work (i.e. if customers do not find the value hypothesis compelling) then it's time to pivot. Pivoting is just testing a new value hypothesis. It's adjusting your business assumptions in the face of new evidence gained from testing the MVP.

Both of these ideas have permeated popular culture (or at least startup culture) to the point where they're way overused. I think Reis's original intent with MVPs and pivoting makes a lot of sense, but there are some valid criticisms. The biggest one is just how subjective all of it can be. It's not like you get a simple binary pass/fail. The initial product might be kind-of sort-of working, but not quite working well enough, but maybe it will with a few more tweaks or by adding features X and Y. It's very hard to know where you're really at, and no amount of measuring and experimentation can eliminate that ambiguity. Still, as a framework for getting to a viable business model it's a very logical approach.

Overall, I thought this was a great read. Reis's framework for product development turns out to be surprisingly flexible. Even though the focus of the book is on building new businesses, I think the concepts are general enough that they can be applied to a much wider set of circumstances. In a sense, Reis is just broadening the definition of what it means to be innovative (a topic I've written about before) and defining a strategy to consistently achieve innovative results. The book covers a lot more ground than what I discussed here, but it's very accessible and easy to get through in a few days. Highly recommend it.

A Sampling Of Monte Carlo Methods

John Wittenauer — Mon, 16 Apr 2018 00:18:06 GMT

Learning data science is a process of exploration. It involves continually expanding the surface area of concepts and techniques that you have at your disposal by learning new topics that build on or share a knowledge base with the topics you've already mastered. To visualize this, one can imagine a vast network of interconnected nodes. Some nodes sit toward the outside of the graph and have a lot of edges directed toward them - these are topics that require understanding lots of other related concepts before they can be learned. Then there are nodes on the interior of the graph with lots and lots of connections that lead everywhere. These are the really foundational concepts that open the doors to all sorts of new discoveries.

My own experience has been that some of these really important topics can slip through the cracks for a surprisingly long time, especially if you're mostly self-taught like I am. Going back to basics can seem like a waste of time when all of the mainstream focus and attention is on sexy new stuff like self-driving cars, but I've found that understanding the basics at a deep level really compounds your knowledge returns over time.

In a recent post I did an introduction to one such "foundational topic" called Markov Chains. Today we'll explore a related but perhaps even more basic concept - Monte Carlo methods.

Monte Carlo methods are a class of techniques that use random sampling to simulate a draw from some distribution. By making repeated draws and calculating an aggregate on the distribution of those draws, it's possible to approximate a solution to a problem that may be very hard to calculate directly.

If that sounds overly esoteric, don't worry; we're going to step through some examples that will really help crystallize the above statement. These examples are intentionally basic. They're designed to illustrate the core concept without getting lost in problem-specific details. Consider these a starting point for learning how to apply Monte Carlo more broadly.

One key point that's worth stating - Monte Carlo methods are an approach, not an algorithm. This was confusing to me at first. I kept looking for a "Monte Carlo" python library that implemented everything for me like scikit-learn does. There isn't one. It's a way of thinking about a problem, similar to dynamic programming. Each problem is different. There may be some patterns but they have to be learned over time. It isn't something that can be abstracted into a library.

The application of Monte Carlo methods tends to follow a pattern. There are four general steps, and you'll see below that the problems we tackle pretty much adhere to this formula.

Create a model of the domain
Generate random draws from the distribution over the domain
Perform some deterministic calculation on the output
Aggregate the results

This sequence informs us about the type of problems where the general application of Monte Carlo methods is useful. Specifically, when we have some generative model of a domain (i.e. something that we can use to generate data points from at will) and want to ask a question about that domain that isn't easily answered analytically, we can use Monte Carlo to get the answer instead.

To start off, let's tackle one of the simplest domains there is - rolling a pair of dice. This is very straightforward to implement.

%matplotlib inline
import random

def roll_die():
    return random.randint(1, 6)

def roll_dice():
    return roll_die() + roll_die()

print(roll_dice())
print(roll_dice())
print(roll_dice())

9
8
9

Think of the dice as a probability distribution. On any given roll, there's some likelihood of getting each possible number. Collectively, these probabilities represent the distribution for the dice-rolling domain. Now imagine you want to know what this distribution looks like, having only the knowledge that you have two dice and each one can roll a 1-6 with equal probability. How would you calculate this distribution analytically? It's not obvious, even for the simplest of domains. Fortunately there's an easy way to figure it out - just roll the dice over and over, and count how many times you get each combination!

import matplotlib.pyplot as plt

def roll_histogram(samples):
    rolls = []
    for _ in range(samples):
        rolls.append(roll_dice())

    fig, ax = plt.subplots(figsize=(12, 9))
    plt.hist(rolls, bins=11)

roll_histogram(100000)

The histogram gives us a visual sense of the likelihood of each roll, but what if we want something more targeted? Say, for example, that we wanted to know the probability of rolling a 6 or higher? Again, consider how you would solve this with an equation. It's not easy, right? But with a few very simple lines of code we can write a function that makes this question trivial.

def prob_of_roll_greater_than_equal_to(x, n_samples):
    geq = 0
    for _ in range(n_samples):
        if roll_dice() >= x:
            geq += 1

    probability = float(geq) / n_samples
    print('Probability of rolling greater than or equal to {0}: {1} ({2} samples)'.format(x, probability, n_samples))

All we're doing is running a loop some number of times and rolling the dice, then recording if the result is greater than or equal to some number of interest. At the end we calculate the proportion of samples that matched our critera, and we have the probability we're interested in. Easy!

You might notice that there's a parameter for the number of samples to draw. This is one of the tricky parts of Monte Carlo. We're relying on the law of large numbers to get an accurate result, but how large is large enough? In practice it seems you just have to tinker with the number of samples and see where the result begins to stabilize (think of it as a hyper-parameter that can be tuned).

To make this more concrete, let's try calculating the probability of a 6 or higher with varying numbers of samples.

prob_of_roll_greater_than_equal_to(6, n_samples=10)
prob_of_roll_greater_than_equal_to(6, n_samples=100)
prob_of_roll_greater_than_equal_to(6, n_samples=1000)
prob_of_roll_greater_than_equal_to(6, n_samples=10000)
prob_of_roll_greater_than_equal_to(6, n_samples=100000)
prob_of_roll_greater_than_equal_to(6, n_samples=1000000)

Probability of rolling greater than or equal to 6: 0.9 (10 samples)
Probability of rolling greater than or equal to 6: 0.68 (100 samples)
Probability of rolling greater than or equal to 6: 0.723 (1000 samples)
Probability of rolling greater than or equal to 6: 0.7217 (10000 samples)
Probability of rolling greater than or equal to 6: 0.72135 (100000 samples)
Probability of rolling greater than or equal to 6: 0.722335 (1000000 samples)

In this case 100 samples wasn't quite enough, but 1,000,000 was probably overkill. This is going to vary depending on the problem though.

Let's move on to something slightly more complicated - calculating the value of 𝜋. If you're not aware, 𝜋 is the ratio of a circle's circumference to its diameter. In other words, if you "unrolled" a circle with a diameter of one you would get a line with a length of 𝜋. There are analytical ways to derive the value of 𝜋, but what if we didn't know that? What if all we knew was the definition above? Monte Carlo to the rescue!

To understand the function below, imagine a unit circle inscribed in a unit square. We know that the area of a unit circle is 𝜋/4, so if we generate a bunch of points randomly in a unit square and record how many of them "hit" in the circle's area, the ratio of "hits" to "misses" should be equal to 𝜋/4. We then multiply by 4 to get an approximation of 𝜋. This works with a full circle as well as a quarter circle (which we'll use below).

import math

def estimate_pi(samples):
    hits = 0
    for _ in range(samples):
        x = random.random()
        y = random.random()

        if math.sqrt((x ** 2) + (y ** 2)) < 1:
            hits += 1

    ratio = (float(hits) / samples) * 4
    print('Estimate with {0} samples: {1}'.format(samples, ratio))

Let's try it out with varying numbers of samples and see what happens.

estimate_pi(samples=10)
estimate_pi(samples=100)
estimate_pi(samples=1000)
estimate_pi(samples=10000)
estimate_pi(samples=100000)
estimate_pi(samples=1000000)

Estimate with 10 samples: 3.2
Estimate with 100 samples: 3.12
Estimate with 1000 samples: 3.172
Estimate with 10000 samples: 3.1352
Estimate with 100000 samples: 3.14964
Estimate with 1000000 samples: 3.14116

We should observe that as we increase the number of samples, the result is converging on the value of 𝜋. If the logic I described above for how we're getting this result isn't clear, a picture might help.

def plot_pi_estimate(samples):
    hits = 0
    x_inside = []
    y_inside = []
    x_outside = []
    y_outside = []

    for _ in range(samples):
        x = random.random()
        y = random.random()

        if math.sqrt((x ** 2) + (y ** 2)) < 1:
            hits += 1
            x_inside.append(x)
            y_inside.append(y)
        else:
            x_outside.append(x)
            y_outside.append(y)

    fig, ax = plt.subplots(figsize=(12, 9))
    ax.set_aspect('equal')
    ax.scatter(x_inside, y_inside, s=20, c='b')
    ax.scatter(x_outside, y_outside, s=20, c='r')
    fig.show()

    ratio = (float(hits) / samples) * 4
    print('Estimate with {0} samples: {1}'.format(samples, ratio))

This function will plot randomly-generated numbers with a color indicating if the point falls inside (blue) or outside (red) the area of the unit circle. Let's try it with a moderate number of samples first and see what it looks like.

plot_pi_estimate(samples=10000)

Estimate with 10000 samples: 3.1244

We can more or less see the contours of the circle forming. It should look much clearer if we raise the sample count a bit.

plot_pi_estimate(samples=100000)

Estimate with 100000 samples: 3.1368

Better! It's worth taking a moment to consider what we're doing here. After all, approximating 𝜋 (at least to a few decimal points) is a fairly trivial problem. What's interesting about this technique though is we didn't need to know anything other than basic geometry to get there. This concept generalizes to much harder problems where no other method of calculating an answer is known to exist (or where doing so would be computationally intractable). If sacrificing precision is an acceptable trade-off, then using Monte Carlo techniques as a general problem-solving framework in domains involving randomness and uncertainty makes a lot of sense.

A related use of this technique involves combining Monte Carlo methods with Markov Chains, and is called (appropriately) Markov Chain Monte Carlo (usually abbreviated MCMC). A full explanation of MCMC is well outside of our scope, but I encourage the reader to check out this notebook for more information (side note: it's part of a whole series on Bayesian methods that is really good, and well worth your time). In the interest of not adding required reading to understand the next part, I'll try to briefly summarize the idea behind MCMC.

Like general Monte Carlo methods, MCMC is fundamentally about sampling from a distribution. But unlike before, MCMC is an approach to sampling an unknown distribution, given only some existing samples. MCMC involves using a Markov chain to "search" the space of possible distributions in a guided way. Rather than generating truly random samples, it uses the existing data as a starting point and then "walks" a Markov chain toward a state where the chain (hopefully) converges with the real posterior distribution (i.e. the same distribution that the original sample data came from).

In a sense, MCMC inverts what we saw above. In the dice example, we began with a distribution and drew samples to answer some question about that distribution. With MCMC, we begin with samples from some unknown distribution, and our objective is to approximate, as best we can, the distribution that those samples came from. This way of thinking about it helps to clarify in what situations we need general Monte Carlo methods vs. MCMC. If you already have the "source" distribution and need to answer some question about it, it's a Monte Carlo problem. However, if all you have is some data but you don't know the "source", then MCMC can help you find it.

Let's see an example to make this more concrete. Imagine we have the result of a series of coin flips and we want to know if the coin being used is unbiased (that is, equally likely to land on heads or tails). How would you determine this from the data alone? Let's generate a sequence of coin flips from a coin that we know to be biased so we have some data as a starting point.

def biased_coin_flip():
    if random.random() <= 0.6:
        return 1
    else:
        return 0

n_trials = 100
coin_flips = [biased_coin_flip() for _ in range(n_trials)]
n_heads = sum(coin_flips)
print(n_heads)

In this case since we're producing the data ourselves we know it is biased, but imagine we didn't know where this data came from. All we know is we have 100 coin flips and 60 are heads. Obviously 60 is greater than 50, and 50/100 is what we would guess if the coin was fair. On the other hand, it's definitely possible to get 60/100 heads with a fair coin just due to randomness. How do we move from a point estimate to a distribution of the likelihood that the coin is fair? That's where MCMC comes in.

import pymc3 as pm

with pm.Model() as coin_model:
    p = pm.Uniform('p', lower=0, upper=1)
    obs = pm.Bernoulli('obs', p, observed=coin_flips)
    step = pm.Metropolis()
    trace = pm.sample(100000, step=step)
    trace = trace[5000:]

Understanding this code requires some background in Bayesian statistics as well as PyMC3. Very simply, we define a prior distribution (p) along with an observed variable (obs) representing our known data. We then configure which algorithm we want to use (Metropolis-Hastings in this case) and initiate the chain. The result is a sequence of values that should, in aggregate, represent the most likely distribution that characterizes the original data.

To see what we ended up with, we can plot the values in a histogram.

fig, ax = plt.subplots(figsize=(12, 9))
plt.title('Posterior distribution of $p$')
plt.vlines(p_heads, 0, n_trials / 10, linestyle='--', label='true $p$ (unknown)')
plt.hist(trace['p'], range=[0.3, 0.9], bins=25, histtype='stepfilled', normed=True)
plt.legend()

From this result, we can see that the overwhelming likelihood is that the coin is biased (if it was fair then we would expect the "bulk" of the distribution to be around 0.5). To actually derive a concrete probability estimate though, we need to specify a range for which we would consider the result "fair" and integrate over the probability density function (basically the histogram above). For the sake of argument, let's say that anything between .45-.55 is fair. We can then compute the result using a simple count.

import numpy as np

n_fair = len(np.where((trace['p'] >= 0.45) & (trace['p'] < 0.55))[0])
n_total = len(trace['p'])

print(float(n_fair / n_total))

0.16254736842105263

By our definition of "fair" above, there's roughly a 16% chance that the coin is unbiased.

Hopefully these examples provide a good illustration of the power and usefulness of Monte Carlo methods. As I mentioned at the top, we're just scratching the surface of this topic (and I'm still learning myself). One of the more satisfying feelings for me intellectually is learning about some new idea or topic and then realizing that it relates and connects to other things I already know about in all sorts of interesting ways. I think Monte Carlo methods fit this definition for me, and probably for most readers as well.

The Cryptocurrency Movement

John Wittenauer — Sat, 24 Feb 2018 17:38:24 GMT

I wrote the following thoughts in response to some questions from the CIO of the company I work for about mining Bitcoin. I decided to post my response (lightly edited) here because I think it summarizes my view on cryptocurrencies (and specifically Bitcoin) pretty well.

Currencies or stores of value require trust – trust that a unit of it will be recognized and accepted by others as a medium of exchange, trust that its supply is limited to prevent arbitrary devaluation, etc. All known forms of currency before 2008 relied on either centralization (fiat currencies) or physical scarcity (gold, commodities) to establish trust.

Bitcoin, and cryptocurrencies more generally, attempt to do something that has never been possible before – how do you create trust in a decentralized, digital system with no top-down control or ownership, in an environment where bits can be copied or manipulated at zero cost?

Decentralization has a lot of benefits, if you can pull it off. Under the right conditions, it enables humans to organize and collaborate in fundamentally new ways. In the long run, it may even disrupt political and social institutions by replacing networks with markets. But to do that, one first needs to solve the trust problem.

Bitcoin’s answer to this problem is “proof of work” – an algorithm for creating distributed trustless consensus. It gets around the double-spend problem (inconsistent ledgers) while also incentivizing validation of transactions on the ledger by using cryptography to require increasingly harder mathematical “puzzles” be solved to confirm a transaction.

Why increasingly harder? To prevent malicious actors from manipulating the ledger. Imposing a scaling cost on adding to the ledger makes it intractable to “re-write” large portions of the ledger. In would require compute power almost as big as the entire network.

In order to incentivize network participants to keep doing these “puzzles”, completing a puzzle rewards a small amount of Bitcoin. The reward itself has no intrinsic value, but belief in the network assigns it a value by the market. As the network expands, the “conventional” value of the reward increases, leading to more mining participation to keep up with the demands of the network in a self-reinforcing feedback loop.

However, there are problems with this model. Proof of work comes with a societal cost via consumption of other scarce resources (electricity). Since fiat money can buy compute power, and thus voting power in the network, it can lead to a de-facto centralization. The blockchain community is well aware of these limitations and a lot of time and effort is being devoted to solving them. Ethereum, for example, is planning to implement a different verification algorithm called proof of stake that theoretically eliminates these downsides. Bitcoin could follow suit eventually, or end up with an entirely different solution.

Could Bitcoin get much bigger than it is today? Yes, absolutely. Bitcoin’s market cap is proportional to the number of believers in the network. And compared to traditional financial markets, it’s tiny. All Bitcoin combined is worth less than $200 billion. By comparison, the worldwide value of gold is ~$8 trillion. Equity markets are $100 trillion. Currency markets are bigger still. There are lots of good reasons why Bitcoin probably won’t ever get that big, but it might.

Given its potential size, does it make sense to try your hand at mining? Probably not. From an economics standpoint, the market is highly efficient. Participation has been commoditized thanks to easy access to specialized mining hardware. No barrier to entry, hence no moat. If the goal is simply understanding rather than financial gain, there’s nothing one could learn from mining that couldn’t be learned independently. Everything there is to know about how this stuff works is freely available online. The code is even open-source.

Many in the tech world view cryptonetworks today as analogous to the early stages of the internet. The implication, of course, is that the technology will be every bit as impactful as the internet has been, but it may take a while to see it materialize. They may well be right, but it's important to emphasize that it's still really early in the cycle. The internet went through a funding nuclear winter before it took off, and the same could still happen to crypto. The possibilities are exciting for sure, but personally I'm trying to temper short-term hype or extrapolations of value and take a long-term view.

Markov Chains From Scratch

John Wittenauer — Tue, 16 Jan 2018 02:12:08 GMT

Sometimes it pays to go back to basics. Data science is a massive, complicated field with seemingly endless topics to learn about. But in our rush to learn about the latest deep learning trends, it's easy to forget that there are simple yet powerful techniques right under our noses. In this post we'll explore one such technique called a Markov chain. By building one from scratch using nothing but standard Python libraries, we'll see how simplistic they can be while also yielding some cool results.

Markov chains are essentially a way to capture the probability of state transitions in a system. A process can be considered a Markov process if one can make predictions about the future state of the process based solely on its present state (or several of the most recent states for a higher-order Markov process). In other words, the history doesn't matter beyond a certain point. There are lots of great explainers out there so I'll leave that for the reader to explore independently (this one is my favorite). It will become clearer as we step through the code, so let's dive in.

For this example we're going to build a language-based Markov chain. More specifically, we'll read in a corpus of text and identify pairs of words that appear together. The pairings are sequential such that when a word $w1$ is followed by a word $w2$, then we say that the system has a probabilistic state transition from $w1$ to $w2$. An example will help. Consider the phrase "the brown fox jumped over the lazy dog". If we break this down by word pairings, our state transitions would look like this:

the: [brown, lazy]
brown: [fox]
fox: [jumped]
over: [the]
lazy: [dog]

This set of state transitions is called a Markov chain. With this in hand we can now choose a starting point (i.e. a word in the corpus) and "walk the chain" to create a new phrase. Markov chains built in this manner over large amounts of text can produce surprisingly realistic-sounding phrases.

In order to get started we need a corpus of text. Anything sufficiently large will do, but to really have some fun (and at the risk of bringing politics into the mix) we're going to make Markov chains great again by using this collection of text from Donald Trump's campaign speeches. Our first step is to import the text file and parse it into words.

import urllib2
text = urllib2.urlopen('https://raw.githubusercontent.com/ryanmcdermott/trump-speeches/master/speeches.txt')
words = []
for line in text:
    line = line.decode('utf-8-sig', errors='ignore')
    line = line.encode('ascii', errors='ignore')
    line = line.replace('\r', ' ').replace('\n', ' ')
    new_words = line.split(' ')
    new_words = [word for word in new_words if word not in ['', ' ']]
    words = words + new_words

print('Corpus size: {0} words.'.format(len(words)))

Corpus size: 166259 words.

I did some clean-up by converting it to ASCII and removing line breaks but that's about it, the rest of the text is just left as it appears in the source file. Our next step is to build the transition probabilities. We'll represent our transitions as a dictionary where the keys are the distinct words in the corpus and the value for a given key is a list of words that appear after that key. To build the chain we just need to iterate through the list of words, add it to the dictionary if it's not already there, and add the word proceeding it to the list of transition words.

chain = {}
n_words = len(words)
for i, key in enumerate(words):
    if n_words > (i + 1):
        word = words[i + 1]
        if key not in chain:
            chain[key] = [word]
        else:
            chain[key].append(word)

print('Chain size: {0} distinct words.'.format(len(chain)))

Chain size: 13292 distinct words.

It may come as a surprise that we're just naively inserting words into the transition list without caring if that word had appeared already or not. Won't we get duplicates, and isn't that a problem? Yes we will, and no it's not. Think of this as a simplistic way of representing the transition probability. If a word appears multiple times in the list, and we sample from the list randomly during a transition, there's a higher likelihood that we pick that word proportional to the number of times it appeared after the key relative to all the other words in the corpus that appeared after that key.

Now that we've built our Markov chain, we can get to the fun part - using it to generate phrases! To do this we only need two pieces of information - a starting word, and a phrase length. We're going to randomly select a starting word from the corpus and make our phrases tweet-length by sampling until our phrase hits 140 characters (assume we're part of the #never280 crowd). Let's give it a try.

import random
w1 = random.choice(words)
tweet = w1

while len(tweet) < 140:
    w2 = random.choice(chain[w1])
    tweet += ' ' + w2
    w1 = w2

print(tweet)

Were not going to run by the 93 million people are, where were starting. New Hampshire." I PROMISE. I do so incredible, and be insulted, Chuck.

Not bad! The limitations of using only one word for context are readily apparent though. We can improve it by using a 2nd-order Markov chain instead. This time, instead of using simple word pairings, our "keys" will be the set of distinct tuples of words that appear in the text. Borrowing from the example phrase earlier, a 2nd-order Markov chain for "the brown fox jumped over the lazy dog" would look like:

(the, brown): [fox]
(brown, fox): [jumped]
(fox, jumped): [over]
(jumped, over): [the]
(over, the): [lazy]
(the, lazy): [dog]

In order to build a 2nd-order chain, we have to make a few modifications to the code.

chain = {}
n_words = len(words)
for i, key1 in enumerate(words):
    if n_words > i + 2:
        key2 = words[i + 1]
        word = words[i + 2]
        if (key1, key2) not in chain:
            chain[(key1, key2)] = [word]
        else:
            chain[(key1, key2)].append(word)

print('Chain size: {0} distinct word pairs.'.format(len(chain)))

Chain size: 72373 distinct word pairs.

We can do a sanity check to make sure it's doing what we expect by choosing a word pair that appears somewhere in the text and then examining the transitions in the chain for that pair of words.

chain[("Its", "so")]

['great',
 'great',
 'easy.',
 'preposterous.',
 'important...',
 'simple.',
 'simple.',
 'horrible.',
 'out',
 'terrible.',
 'sad.',
 'much',
 'can',
 'easy.',
 'embarrassing',
 'astronomical']

Looks about like what I'd expect. Next we need to modify the "tweet" code to handle the new design.

r = random.randint(0, len(words) - 1)
key = (words[r], words[r + 1])
tweet = key[0] + ' ' + key[1]

while len(tweet) < 140:
    w = random.choice(chain[key])
    tweet += ' ' + w
    key = (key[1], w)

print(tweet)

there. They saw it. He talks about medical cards. He talks about fixing the VA health care. They want to talk to me from Georgia? "Dear So and

Better! Let's turn this into a function that we can call repeatedly to see a few more examples.

def markov_tweet(chain, words):
    r = random.randint(0, len(words) - 1)
    key = (words[r], words[r + 1])
    tweet = key[0] + ' ' + key[1]

    while len(tweet) < 140:
        w = random.choice(chain[key])
        tweet += ' ' + w
        key = (key[1], w)

    print(tweet + '\n')

markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)
markov_tweet(chain, words)

East. But we have a huge subject. Ive been with the Romney campaign. Guys made tens of thousands of people didnt care about the vets in one hour.

somebody is going to put American-produced steel back into the sky. It will be the candidate. But I think 11 is a huge problem. And Im on the

THAT WE CAN ONLY DREAM ABOUT. THEY HAVE A VERY BIG BEAUTIFUL GATE IN THAT WALL, BIG AND BEAUTIFUL, RIGHT. NO. NO, I DON'T KNOW WHERE THEY HAVE

We need to get so sick of me. I didnt want the world my tenant. They buy condos for tens of millions of dollars overseas. And too many executive

Wont be as good as you know, started going around and were going to win. Were going to happen. Thank you. SPEECH 8 This is serious rifle. This

That's all there is to it! Incredibly simple yet surprisingly effective. It's obviously not perfect but it's not complete gibberish either. If you run it enough times you'll find some combinations that actually sound pretty plausible. These results could probably be improved significantly with a much more powerful technique like a recurrent neural net, but relative to the effort involved it's hard to beat Markov chains.

Sapiens

John Wittenauer — Fri, 03 Nov 2017 01:43:08 GMT

This post is about Yuval Noah Harari's book Sapiens. All of my favorite books have changed the way I view the world in some non-trivial way. There's nothing quite like the experience of reading a book and then, afterward, realizing that you can't go back to who you were before you read it because you just see things differently. Sapiens falls into that category for me. At first glance, one would expect that this is a book about the inexorable rise of humankind - where we came from, how we evolved to what we are today, and what we were like along the way. It is that, at least in a sense, but it's also so much more. Sapiens goes beyond a simple stating of the facts as we know them and uncovers some deeper (and often inconvenient) truths about humankind. Harari's analysis of WHY history unfolded the way it did is more fascinating that the history itself, and paints a much more holistic picture of what it means to be human.

One of the most interesting themes throughout the book is the role that imagination has played in our species' ascendancy on the world stage. I think most people probably believe that humans have always been at the top of the food chain, but for most of our history we were essentially foragers (or as Harari put it, "an animal of no significance"). It wasn't until the "cognitive revolution" around 70,000 years ago that humans moved to the top of the food chain. The reason seems to be that our new cognitive skills allowed us to coordinate with other humans on a level that the world had not seen before. What specifically enabled this were two things - new language skills, and the ability to invent fiction. Our newfound ability to imagine things that aren't real allowed humans to organize around shared beliefs in myths that we, ourselves, created.

If you think about it, transitioning from a world in which all life operates via basic biological principles to one in which a species can imagine alternate realities is a big deal. This change arguably led to everything that has come since, to all of the accumulated knowledge that our civilization has amassed in the many generations that followed. Before this transition, an organism's operating system was programmed in its DNA. Some organisms could acquire limited amounts of knowledge from their environment, such as the best places to hunt or tricks to avoid being eaten, but everything else was hard-wired. With the ability to imagine, humans could rewrite their own operating system. We could create an idea - a conception of a thing that didn't exist, and make it reality.

What I find particularly fascinating is Harari's argument that basically every organizing principle in modern human society falls under the category of a shared belief in fiction. Religions, laws, nations, companies, human rights - these are all, in a sense, figments of our collective imaginations. Biology doesn't take a stance on how humans should treat each other. There is no physical necessity for the borders we draw around nations, no tangible thread holding together the assets of a corporation. All of this seems completely obvious upon introspection, but I discovered that until reading this book I had never really thought about it. I've found that the way I view things like political debate has changed as a result.

Sapiens delves into numerous topics that one might be surprised to find in a "history" book. For instance, there's a chapter that questions whether or not we're any happier or more fulfilled as a species for all the progress we've made. In another chapter, Harari looks at the trends of the past and begins to speculate about where we're headed next. He discusses the role that technologies like artificial intelligence and genetic engineering might play in this future, and whether or not we'll even still call ourselves Homo sapiens.

Philosophy aside, Sapiens does a great job of covering the major themes in our species' history. The agricultural revolution, the invention of written language and money, imperialism, the scientific revolution, the rise of capitalism and industry. The book covers a lot of ground. Harari's analysis is often counter-intuitive (and probably controversial), but always thoroughly researched and very well-reasoned. For example, he called the agricultural revolution "history's biggest fraud" because for most humans, the transition led to a much harsher and less rewarding life. He documents the "imagined order" of large societies that placed people into hierarchies, and concludes that the particular structures of these hierarchies are mostly accidents of history. He argues that scientific research flourished only in alliance with ideologies such as capitalism or imperialism, because these ideologies funded the cost of research. These conclusions are not immediately obvious but he makes a compelling case.

There was so much packed into this book that I'm really just scratching the surface. It's not a quick read but the writing style is very engaging so it doesn't feel laborious like you might expect for a book with this subject matter. I would highly recommend it to anyone interested in expanding their knowledge about humankind's social and cultural evolution.

Time Series Forecasting With Prophet

John Wittenauer — Sat, 02 Sep 2017 01:51:51 GMT

Prophet is an open source forecasting tool built by Facebook. It can be used for time series modeling and forecasting trends into the future. Prophet is interesting because it's both sophisticated and quite easy to use, so it's possible to generate very good forecasts with relatively little effort or domain knowledge in time series analysis.

There are a few requirements you'll need to meet in order to use the library. It uses PyStan to do all of its inference, so PyStan has to be installed. PyStan has its own dependencies, including a C++ compiler. Python 3 also appears to be a requirement. Full installation instructions are here.

Let's take a quick tour through Prophet's capabilities. We can start by reading in some sample time series data. In this case we're using Wikipedia page hits for Peyton Manning, which is the data set that Facebook collected for the library's example code.

%matplotlib inline
import os
import pandas as pd
import numpy as np
from fbprophet import Prophet

path = os.path.dirname(os.path.dirname(os.getcwd())) + '/data/manning.csv'
data = pd.read_csv(path)
data['ds'] = pd.to_datetime(data['ds'])
data.head()

	ds	y
0	2007-12-10	14629
1	2007-12-11	5012
2	2007-12-12	3582
3	2007-12-13	3205
4	2007-12-14	2680

There are only two columns in the data, a date and a value. The naming convention of using 'ds' for the date and 'y' for the value is apparently a requirement to use Prophet; it's expecting those exact names and will not work otherwise!

Let's examine the data by plotting it using pandas' built-in plotting function.

data.set_index('ds').plot(figsize=(12, 9))

The data is highly volatile with order-of-magnitude differences between a typical day and a high-traffic day. This will be hard to model directly. Let's try applying a log transform to see if that helps.

data['y'] = np.log(data['y'])
data.set_index('ds').plot(figsize=(12, 9))

Much better! Not only is it stationary, but we've also revealed what looks like some cyclical patterns in the data. We can now instantiate a Prophet model and fit it to our data.

m = Prophet()
m.fit(data)

That was easy! This is one of the most attractive features of Prophet. It essentially does all of the model selection work for you and gives you a result that works well without much user input required. In this case we didn't have to specify anything at all, just give it some data and we get a model.

We'll explore below what the model looks like but it's worth spending a moment first to explain what's going on here. Unlike typical time-series methods like ARIMA (which are considered generative models), Prophet uses something called an additive regression model. This is essentially a sophisticated curve-fitting model. I haven't dug into any of the math, but based on the description in their introductory blog post, Prophet builds separate components for the trend, yearly seasonality, and weekly seasonality in the time series (with holidays as an optional fourth component). We can witness this directly by looking at one of the undocumented properties on the model object that shows the fitted parameters.

m.params

{u'beta': array([[ 0.        , -0.03001147,  0.04819977,  0.00999481, -0.00228437,
          0.01252909,  0.01559136,  0.00950633,  0.00075704,  0.00391209,
         -0.00586589,  0.0075454 , -0.00524287,  0.00208091, -0.00477578,
         -0.00410379, -0.0077744 , -0.00081338,  0.00125811,  0.00187115,
          0.0069828 , -0.01233829, -0.01057246,  0.00938595,  0.00847051,
          0.00088024, -0.00352237]]),
 u'delta': array([[  1.62507395e-07,   1.29092081e-08,   3.48169254e-01,
           4.57815903e-01,   1.61826714e-07,  -5.66144938e-04,
          -2.34969389e-01,  -2.46905754e-01,   9.96595883e-08,
          -1.82605683e-07,   6.12381739e-08,   2.78653912e-01,
           2.30631082e-01,   2.83118248e-03,   1.55276178e-03,
          -8.61134360e-01,  -3.14239669e-07,   5.54456073e-09,
           4.91423429e-07,   4.71475093e-01,   7.93935609e-03,
           1.36547372e-07,  -3.38274613e-01,  -3.20008088e-07,
           1.16410210e-07]]),
 u'gamma': array([[ -5.37486490e-09,  -8.40863029e-10,  -3.59567303e-02,
          -6.19588853e-02,  -2.69802216e-08,   1.12158987e-04,
           5.44799089e-02,   6.53304459e-02,  -2.95648930e-08,
           6.03344459e-08,  -2.21556944e-08,  -1.09561865e-01,
          -9.78411305e-02,  -1.28994139e-03,  -7.57253043e-04,
           4.47568989e-01,   1.73293155e-07,  -3.23167613e-09,
          -3.01853068e-07,  -3.04398195e-01,  -5.37507537e-03,
          -9.67767399e-08,   2.50366597e-01,   2.46999155e-07,
          -9.35053320e-08]]),
 u'k': array([[-0.35578215]]),
 u'm': array([[ 0.62604285]]),
 u'sigma_obs': array([[ 0.03759107]])}

I think the beta, delta, and gamma arrays correspond to the distributions for the three different components. The way I think about this is we're saying we have three different regression models with some unknown set of parameters, and we want to find the combination of those models that best explains the data. We can attempt to do this using maximum a-priori (MAP) estimation, where our priors are the equations for the regression components (piecewise linear for the trend, Fourier series for the seasonal component, and so on). This appears to be what Prophet is doing. I can't say I've looked at it in any great detail so part of that explanation could be wrong, but I think it's broadly correct.

Now that we have a model, let's see what we can do with it. The obvious place to start is to forecast what we think our value will be for some future dates. Prophet makes this easy with a helper function.

future_data = m.make_future_dataframe(periods=365)
future_data.tail()

	ds
3265	2007-01-15
3266	2007-01-16
3267	2007-01-17
3268	2007-01-18
3269	2007-01-19

That gives us a data frame with dates going one year forward from where our data ends. We can then use the "predict" function to populate this data frame with forecast information.

forecast = m.predict(future_data)
forecast.columns

Index([u'ds', u't', u'trend', u'seasonal_lower', u'seasonal_upper',
       u'trend_lower', u'trend_upper', u'yhat_lower', u'yhat_upper', u'weekly',
       u'weekly_lower', u'weekly_upper', u'yearly', u'yearly_lower',
       u'yearly_upper', u'seasonal', u'yhat'],
      dtype='object')

The point estimate forecasts are in the "yhat" column, but note how many columns got added. In addition to the forecast itself we also have point estimates for each of the components, as well as upper and lower bounds for each of these projections. That's a lot of detail provided out-of-the-box just by calling a single function!

Let's see an example.

forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

	ds	yhat	yhat_lower	yhat_upper
3265	2007-01-15	8.200620	7.493151	8.886727
3266	2007-01-16	8.525638	7.791967	9.266697
3267	2007-01-17	8.313019	7.620597	9.000529
3268	2007-01-18	8.145577	7.449701	8.870133
3269	2007-01-19	8.157476	7.467178	8.860933

Prophet also supplies several useful plotting functions. The first one is just called "plot", which displays the actual values along with the estimates. For the forecast period it only displays the projections since we don't have actual values for this period.

m.plot(forecast);

I found this to be a bit confusing because the data frame we passed in only contained the "forecast" date range, so where did the rest of it come from? I think the model object is storing the data it was trained on and using it as part of this function, so it looks like it will plot the whole date range regardless.

We can use another built-in plot to show each of the individual components. This is quite useful to visually inspect what the model is capturing from the data. In this case there are a few clear takeaways such as higher activity during football season or increased activity on Sunday & Monday.

m.plot_components(forecast);

In addition to the above components, Prophet can also incorporate possible effects from holidays. Holidays and dates for each holiday have to be manually specified over the entire range of the data set (including the forecast period). The way holidays get defined and incorporated into the model is fairly simple. Below are some holiday definitions for our current data set that include Peyton Manning's playoff and Superbowl appearances (taken from the example code).

playoffs = pd.DataFrame({
  'holiday': 'playoff',
  'ds': pd.to_datetime(['2008-01-13', '2009-01-03', '2010-01-16',
                        '2010-01-24', '2010-02-07', '2011-01-08',
                        '2013-01-12', '2014-01-12', '2014-01-19',
                        '2014-02-02', '2015-01-11', '2016-01-17',
                        '2016-01-24', '2016-02-07']),
  'lower_window': 0,
  'upper_window': 1,
})

superbowls = pd.DataFrame({
  'holiday': 'superbowl',
  'ds': pd.to_datetime(['2010-02-07', '2014-02-02', '2016-02-07']),
  'lower_window': 0,
  'upper_window': 1,
})

holidays = pd.concat((playoffs, superbowls))

Once we have holidays defined in a data frame, using them in the model is just a matter of passing in the data frame as a parameter when we define the model.

m = Prophet(holidays=holidays)
forecast = m.fit(data).predict(future_data)
m.plot_components(forecast);

Our component plot now includes a holidays component with spikes indicating the magnitude of influence those holidays have on the value.

While the Prophet library itself is very powerful, there are some useful features that we'd typically want when doing time series modeling that it currently doesn't provide. One very simple and obvious thing that's needed is a way to evaluate the forecasts. We can do this ourselves using scikit-learn's metrics (you could also calculate it yourself). Note that since we took the natural log of the series earlier we need to reverse that to get a meaningful number.

from sklearn.metrics import mean_absolute_error
data = m.predict(data)
mean_absolute_error(np.exp(data['y']), np.exp(data['yhat']))

2436.9620410194648

That works fine as a very simple example, but for real applications we'd probably want something more robust like cross-validation over sliding windows of the data set. Currently in order to accomplish this we'd have to implement it ourselves.

Another limitation is the lack of ability to incorporate additional information into the model. One can imagine variables that could be used along with the time series to further improve the forecast (for example, a variable indicating if Peyton Manning had just won a game, or had a particularly good performance, or appeared in some news articles). We can't do anything like this with Prophet directly. However, one idea I've experimented with in the past that may get around this limitation is building a two-stage model. The first stage is the Prophet model, and we use that to generate predictions. The second stage is a normal regression model that includes the additional signals as independent variables. The wrinkle is that instead of predicting the target directly, we predict the error from the time series model. When you put the two together, this may result in an even better overall forecast.

All things considered, Prophet is a great addition to the toolbox for time series problems. There are a number of knobs and dials that one can tweak that I didn't get into because I still haven't tried them out, but they provide options for advanced users to improve their forecasts even further. It's worth cautioning that this software is fairly immature so proceed carefully if using it for any serious tasks. That said, the authors claim Facebook uses it extensively so take that for what it's worth.

Zero To One

John Wittenauer — Mon, 26 Jun 2017 00:15:25 GMT

This post is about Peter Thiel's book Zero to One. I'm generally fascinated by how very smart people view the world, particularly if those views are unpopular. Peter Thiel, a well-known entrepreneur/investor and famously contrarian thinker, definitely fits into this category. In this book, Peter lays out his perspective on building the future. The central thesis is the idea that progress doesn't happen on its own - someone has to make it happen. This is where the "zero to one" phrase from the title comes in. It refers to the fact that most things are simply copying or iterating on something that's already been done (1 to n). It takes genius and courage to invent transformative new technology that creates abundance and prosperity (0 to 1).

In Peter's view this technology is most likely to come from startups (as he put it, a startup is the largest group of people you can convince of a plan to build a different future). With that perspective in mind, much of the book focuses on lessons about how to "build the future" through a startup. But I think the main points of the book are valuable whether you have any interest in startups or not. They offer new ways of thinking about technology, markets, and competition. It's bold, contrarian, and thought-provoking. Here are some of the key insights.

Competition & Monopoly

One of the most notable threads that gets touched on often is the nature of competition. Most economists view perfectly competitive markets as capitalism working as intended. Competition causes each participant to raise their game, lower prices, deliver greater value etc. in the pursuit of customers and profit. But Peter argues that competition and capitalism are actually opposites. Competitive markets erode profits until no one is making any money. When margins are very thin and profit is hard to come by, companies can't afford to do anything except fight for market share. They're trapped by short-term thinking.

The opposite state is when a company has a monopoly - complete dominance of a market protected by a huge moat. Most of us would say that monopolies are bad. Monopolies allow companies to charge high prices, slack off on customer service, cut R&D investment, and lots of other bad things. But Thiel argues this is only true in a world where nothing changes. In dynamic, technology-driven markets, creative monopolies can actually be good for society because they have both the incentive and the available cash flow to invest enormous resources into inventing new technologies. Rather than rent-seeking, this class of monopoly creates new categories of abundance.

Thiel's perspective on competition and monopoly is a core component of his advice for building a company. If competition is problematic then the best thing to do is start by owning a small market as a monopoly and expand to adjacent markets from there. This sounds easy but is very hard in practice, which is why few companies achieve it. Companies that do gain a monopoly are able to build some durable, lasting advantage in their market. This usually comes in the form of proprietary technology, network effects, economies of scale, and branding. Some combination of these advantages can generate significant long-term value.

My own perspective is that having a monopoly isn't binary. Nor are markets entirely static or dynamic. These things exist on a continuous spectrum. All companies are probably somewhere between perfect monopoly and perfect competition in any given market, no matter how that market is defined. And all monopolies probably cause some harm to consumers, even if they also lead to some new inventions. It's useful to speak about these things with such rigidity in the abstract to make a point, but the real world is messy. That doesn't mean he's wrong, just that there's probably more to the story in most cases.

Easy, Hard, or Impossible

Another theme from the book that I found really interesting was the trichotomy between easy, hard, and impossible. Thiel argues that most things are either easy or impossible to accomplish. Easy things have already been done, and impossible things can never be done no matter how hard we try. However, there are some things that are hard but possible. In some cases these hard things were previously impossible but have become possible thanks to new technology. Hard things that are possible to achieve are "secrets". They're truths about the world that most people do not know. This is because common knowledge about what is possible changes much slower than what is possible in reality. Secrets can come in many forms. They can be about the natural world or they can be about people. These hard things, these "secrets", are what great companies are built on. Companies doing hard things are a shared conspiracy to change the world, founded on a secret known only to those on the inside.

This is a pretty radical view of what it means to be part of a company, however I think it makes sense. Most of us aren't part of a conspiracy to change the world because we're not working for companies doing hard things. We're all probably aware of companies that seem to fit this description though. Tesla would be a good example. Tesla is trying to bring about the end of the fossil fuels era by making electric cars mainstream - hard, but possible. Tesla's employees generally appear to be fanatical about the company's mission. Many on the outside probably don't think Tesla will succeed in its goal, but I bet most of its employees do. Their "secret", though widely known, still may not be widely believed.

Most companies that try to do something hard end up failing. They can fail for a variety of reasons, many of which Peter talks about in the book (culture, team, incentives alignment, distribution, and so on). The distribution of companies that try to do hard things and succeed vs. those that fail probably looks like a power law. This explains why most secrets never see the light of day (and why the secret about hard things remains relatively unknown).

Sales & Persuasion

The last theme in the book that I found interesting has to do with salesmanship and persuasion. As an engineer I'm naturally distrustful of anything that involves "selling". It just feels messy, ambiguous, and somehow dishonest. I think Peter actually changed my mind though. He addressed this very point in the book about engineers underrating sales. It makes sense to talk about sales from the standpoint of building a company. Every successful product or service needs a distribution channel. What intrigued me was thinking about sales in a much more general capacity. Sales is just the art of persuasion. In some sense, we're all selling all the time because persuading people is a normal part of human life. When I really think about it, almost every conversation I have at work or at home involves some amount of persuasion, no matter how mundane. In hindsight, it seems obvious that this is a skill that one can improve on just like anything else.

One other point that Thiel makes is that great salesman are hidden from sight because it's not obvious that they're selling something. He points out the Elon Musk, widely thought of as the consummate engineer, is also a grandmaster salesman. If you think about it this actually makes a lot of sense. Elon is able to get some of the smartest, most driven people in the world to work at his companies. He got a massive loan from the government to build electric cars at a time when the world economy was collapsing. Most of the developed world thinks he's the real-life version of Iron Man. These are the marks of someone who knows how to persuade.

This was one of the most information-dense books I've ever read. Most books use a lot of words to say very little. "Zero to One" completely inverts this norm. Nassim Taleb recommended that everyone read this book three times. I've gone through it twice (it's a quick read) and I'm still picking up new things from it. I think he may be on to something.

Thinking, Fast And Slow

John Wittenauer — Sat, 06 May 2017 19:09:27 GMT

This post is about Daniel Kahneman's book Thinking, Fast and Slow. This is a book that everyone should probably read at some point in their lives, or at least read some cliff notes on to get an understanding of the basic ideas. The reason is because it gets at some fundamental truths about how humans function that are both non-obvious and extremely hard to recognize, even after learning about them. These truths apply to virtually any context one can imagine and are not limited in their relevancy to any particular discipline or field of study.

The central thesis is a dichotomy between two modes of thought - "System 1" and "System 2". "System 1" is fast, instinctive, and effortless. It happens automatically without us even realizing it. "System 2" is slow, deliberate, and logical. It involves effort and focus. In reality there is no physical distinction between the two modes of thought, it's all happening inside our brains at the same time, but it's a useful abstraction because it captures all the ways that "System 1" thinking can mislead us. The way I think about it is "System 1" is a filter on the raw sensory input we're taking in every second. If we think about it in stream processing terms, "System 1" is reading from the raw input stream and applying successively higher-order functions to the stream so we can quickly make sense of the input. It's what allows us to recognize faces, to read a book, to drive a car, or to catch a ball. These things don't require pupil-dilating mental effort on our part - they just happen. We might be aware that they're happening, but it's a passive awareness. We don't have to expend mental energy to make it so.

"System 1" thinking is absolutely essential for us to be able to function on a basic level. But it can also lead us astray. The filters being applied automatically to our input stream lead to thoughts and actions that frequently deviate from the rational agent model of human behavior. They lead to all sorts of cognitive biases such as anchoring, availability, substitution, framing, and overconfidence. These effects can be overcome with deliberate thought and effort - in other words, engaging "System 2" thinking to come up with an objective, logical conclusion. The problem is, these cognitive biases kick in so often and with such ease that it's hard to even recognize when such an error in judgement has occurred.

It's jarring to see just how often we're unconsciously influenced by cognitive biases, and how dramatic of an effect they can have. In one example, Kahneman was illustrating our tendency to accept a default option if one is available. He cited organ donor statistics in two different countries. In one country, the rate of civilians that opted to be an organ donor was something like 90%. In another, similar country the rate was 4%. So what accounted for the difference? In the first country you had to opt OUT of being a donor, and in the second country you had to opt IN. That was it. This is a startling conclusion. It's possible that there were confounding factors that could account for some of this variation, but it's just one example of an effect that's been observed in many different contexts with a high degree of statistical significance.

The reason I think it would be valuable for everyone to read this book is because understanding these biases likely leads to clearer thinking overall. It may be impossible, or even undesirable, to recognize every time our "System 1" thinking is making judgments that objectively do not make logical sense. But with some effort it's certainly possible to weed out the worst offenses and recognize when we're making egregious errors. In my own mental models, I've incorporated what I learned from Kahneman into my generally Bayesian approach to rational thought. I think this is a fairly sensible way to approach solving problems and making decisions.

How To Learn Hadoop For Free

John Wittenauer — Sun, 02 Apr 2017 17:31:55 GMT

The "big data" technology landscape is changing really, really fast. One consequence of this is that it's hard to find good training resources since they become outdated so quickly. I wanted to get some baseline comfort with a variety of technologies in the Hadoop ecosystem but found my options for thorough, guided education somewhat lacking. I eventually settled on MapR's free training courses. Each one is like a miniature version of on online course (most require only a few hours of time). They include interactive video content, quizzes, and various labs to complete using the MapR sandbox. There's a fairly wide range of courses and the content is very professional.

Below is a brief synopsis of the courses they offer. They are completely free to try out - just follow the link above, create an account, and register for the course you're interested in. In addition, I put all of the content for the courses I worked through (including labs with example code) in a github repo.

Note that due to the fast-paced rate of change that I alluded to earlier (and MapR's vested interest in staying current) the course catalog will likely evolve over time. It's possible that this post will become outdated fairly quickly, although I'll try to revisit it periodically to make sure the guidance is still relevant. I should also note that are there snippets of content throughout the training that are specific to the MapR platform, however more than 90% of it is platform-agnostic.

This list is not exhaustive. It only includes the courses that I spent time working on. Feel free to visit the landing page for a complete list of courses.

Hadoop Essentials

These are short, introductory courses that present a very high-level overview of the Hadoop ecosystem.

ESS 100 - Introduction to Big Data

ESS 101 - Apache Hadoop Essentials

ESS 102 - MapR Converged Data Platform Essentials

MapReduce

MapReduce is how it all got started, and is still used quite a bit. MapReduce is a programming model for distributing work over very large data sets across a cluster of machines. The name comes from the two principal steps involved in the process - map (filtering, sorting etc.) and reduce (summary operations).

DEV301 - Developing Hadoop Applications

HBase

HBase is an open-source, non-relational, distributed column-store database written in Java. HBase is very widely used as an alternative to relational databases for certain types of applications where scale is an issue.

DEV320 - HBase Data Model and Architecture

DEV325 - HBase Schema Design

DEV330 - Developing HBase Applications: Basics

DEV335 - Developing HBase Applications: Advanced

DEV340 - HBase Bulk Loading, Performance, and Security

Spark

Spark is an open-source cluster-computing framework that runs on Hadoop. I've written about Spark in the past. Suffice to say that it is a very exciting (and very popular) framework.

DEV360 - Spark Essentials

DEV361 - Build and Monitor Spark Applications

DEV362 - Create Data Pipeline Using Spark

Drill

Drill is an open-source framework for querying semi-structured and unstructured data at scale using SQL-like syntax. I haven't seen a lot of interest in this outside of the MapR distribution but it's a mature technology that has a lot of potential.

DA410 - Drill Essentials

DA415 - Drill Architecture

Hive

Hive is a data warehousing infrastructure built on top of Hadoop that provides the capability to query data in the Hadoop file system using SQL-like syntax. There's some conceptual overlap between Hive, HBase and Drill that requires some background and context to understand. The relevant courses do a good job of clarifying these relationships.

DA440 - Hive Essentials

Pig

Pig is a high-level programming language and framework for doing ETL (extract, transform, and load) tasks with data. I'm not sure how much Pig is used anymore with newer technologies like Spark offering similar capabilities, but I think there is still a use case for it.

DA450 - Pig Essentials

20 Podcasts Worth Listening To

John Wittenauer — Sat, 04 Mar 2017 18:34:16 GMT

I have a confession to make. I love podcasts. I have a semi-unhealthy addiction to podcasts. I get most of my twitter follows and book recommendations from podcasts. They've become an essential part of my daily information diet. The medium is just so good. There's no better way to absorb raw, unfiltered information from interesting people with unique perspectives. And despite its nascent beginnings, the selection of great podcasts to listen to has just exploded. There's way too much good content out there to keep up with!

Below is my curated list of podcasts to try out. This list is obviously biased toward my own fields of interest. If you happen to be an art enthusiast or love pop culture, this list will be next to useless for you. But I'm guessing if you're reading this blog then we have at least SOME shared interests. Even if it's not your thing, give one or two a try - you might be surprised.

Without further ado, here are 20 awesome podcasts that you should definitely check out.

Andreessen Horowitz

http://a16z.com/podcasts/

Perfect for technology nuts and wannabe-entrepreneurs. The a16z staff hosts prominent authors, investors, CEOs etc. to talk emerging technology and how the business of technology is being disrupted. For instance, a few recent episodes covered VR for storytelling, genomics, and the next evolution of cities.

Conversations With Tyler

https://medium.com/conversations-with-tyler/all

Tyler Cowen hosts various guests to discuss a wide range of topics, usually with some angle relating to macro-economics. It's actually pretty hard to be more specific than that, they're kind of all over the place.

Exponent

http://exponent.fm/

Ben Thompson (the "Stratechery" guy) and his co-host James Allworth discuss the business and strategy of technology, with the occasional diversion into bigger-picture topics in society and politics. Ben is really, really good at understanding how tech companies operate and where they're headed. It's kind of like getting a business school degree, but better (and for free).

Hidden Brain

http://www.npr.org/podcasts/510308/hidden-brain

Consider it a weekly lesson in psychology, sociology and human behavior, using real-world stories. I mean, everyone likes stories right?

Invest Like The Best

http://investorfieldguide.com/

Ostensibly it's about investing, but a lot of the guests aren't investors at all so it's hard to pigeonhole. Consider it more an exercise in learning how to think and cultivate curiosity, while also learning about things like hedge funds, venture capital, value investing etc.

O'Reilly Data Show

https://www.oreilly.com/topics/oreilly-data-show-podcast

For big data nerds. Actually most episodes lately are about AI, because fucking everyone in tech now has to talk about AI at every opportunity. But it's also about big data. Fairly in-the-weeds discussion, and a lot of emphasis on open-source projects. It's a great way to stay up-to-date on the big data landscape.

Partially Derivative

http://partiallyderivative.com/

For data science nerds. The crew talks data and machine learning while drinking obscure artisanal beer. Hilarty (and learning) ensues. They've also had a lot of good interviews with various data scientists lately (although they're slacking a bit on the beer).

Radiolab

http://www.radiolab.org/series/podcasts/

Radiolab is just awesome. Not sure how else to put it. They do episodes on all sorts of topics (the latest one on CRISPR was really good). Production quality is super high. The topics are also really accessible. If you're new to podcasting this is a great place to start.

Rationally Speaking

http://rationallyspeakingpodcast.org/archive/

Their tagline is "exploring the borderlands between reason and nonsense". If you're skeptical of that claim, then you should probably be listening to this podcast!

Recode Decode

http://www.recode.net/recode-decode-podcast-kara-swisher

Kara Swisher grills various silicon valley elites about various tech topics. Okay, it's not only that. But it's MOSTLY that.

Revisionist History

http://revisionisthistory.com/

Malcolm Gladwell did a ten-part series where he goes in-depth on random stories from the past and shows how the popular narrative around those events was either wrong or at least incomplete. The show has been on hiatus for a while but worth spinning through the archive.

Waking Up With Sam Harris

https://www.samharris.org/podcast/full_archive

Sam is a really smart dude. He invites lots of other really smart dudes on his show, and they spend a few hours going ludicrously in-depth on various philosophical, scientific, and political subjects. In takes a special type of person to be into this sort of thing, but if you're that type of person, you won't find anything else like this anywhere else.

StarTalk Radio

https://www.startalkradio.net/show/

Neil deGrasse Tyson, also known as the millennials' Carl Sagan, is at his best in these entertaining, free-flowing discussions about various science or science-adjacent topics. All sorts of interesting guests and very wide-ranging interviews. Also, one of the comedians that frequents the show kind of sounds like the guy from Archer, which is pretty cool.

Talking Machines

http://www.thetalkingmachines.com/blog/

For hardcore machine learning nerds. They do lots of interviews with researchers in various specializations. It's pretty technical but very useful if you're an ML practitioner. No new episodes in the last few months but I'm keeping an eye on it.

TED Radio Hour

http://www.npr.org/programs/ted-radio-hour/

This is one of my favorites. Very high production quality, super-wide range of interesting topics. Each hour-long show stitches together excerpts from several TED talks that share some common theme. They also add some narration and frequently interview the TED speakers to add some color to the original talks. Awesome series.

The Ezra Klein Show

http://www.vox.com/ezra-klein-show-podcast

Ezra is a political journalist but the podcast isn't focused on politics (although some of his guests are politicians). Just a lot of really good interviews with really smart people. Good podcast hosts are able to frame questions in ways that guide the discussion in interesting directions, and Ezra is especially good at this.

The Knowledge Project

https://www.farnamstreetblog.com/the-knowledge-project/

Just recently discovered this one. The stated goal is to focus on "actionable strategies that help you make better decisions, avoid stupidity, and live a better life". A lot of the interviews are geared towards reading and knowledge acquisition.

The Tim Ferriss Show

http://tim.blog/podcast/

Pretty much everyone knows who Tim is. In addition to his famous "4-hour" books, he's one of the guys that really launched podcasting as a medium into the mainstream. His tagline is "deconstructing world-class performers". A lot of the interviews are really good, although some could be edited down a bit.

Value Investing Podcast

http://valuepodcast.com/

I'm not gonna lie, these conversations are dry as hell. But if you're serious about investing then there's a LOT of good information to learn here.

Vox's The Weeds

http://www.vox.com/the-weeds

For policy nerds (yes, that's a thing). Actually politics more broadly, but they spend a lot of time focused specifically on policy details (health care, taxes, education, and so on). It isn't called "The Weeds" for nothing. For some reason I find it surprisingly fascinating. Useful if you want to talk circles around baffled family members at the next politically-charged Thanksgiving dinner.

Happy podcasting!