# Deep Learning With Keras: Recurrent Networks

25th August 2019This post is the fourth in a series on deep learning using Keras. We've already looked at dense networks with category embeddings, convolutional networks, and recommender systems. For this installment we're going to use recurrent networks to create a character-level language model for text generation. We'll start with a simple fully-connected network and show how it can be used as an "unrolled" recurrent layer, then gradually build up from there until we have a model capable of generating semi-reasonable sounding text. Much of this content is based on Jeremy Howard's fast.ai lessons. However, we'll use Keras instead of PyTorch and build out all of the code from scratch rather than relying on the fast.ai library.

The text corpus we're using for this task are the works of the philosopher Nietzsche. The whole corpus can be found here. Let's start by loading the data into memory and taking a peek at the beginning of the text.

```
%matplotlib inline
import io
import numpy as np
import keras
from keras.utils.data_utils import get_file
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
text = f.read().lower()
len(text)
```

600893

```
text[:400]
```

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all philosophers, in so far as they have been\ndogmatists, have failed to understand women--that the terrible\nseriousness and clumsy importunity with which they have usually paid\ntheir addresses to truth, have been unskilled and unseemly methods for\nwinning a woman? certainly she has never allowed herself '

Now get the unique set of characters that appear in the text. This is our vocabulary.

```
chars = sorted(list(set(text)))
vocab_size = len(chars)
vocab_size
```

57

```
''.join(chars)
```

'\n !"\'(),-.0123456789:;=?[]_abcdefghijklmnopqrstuvwxyzäæéë'

Let's create a dictionary that maps each unique character to an integer, which is what we'll feed into the model. The actual integer used isn't important, it just has to be unique (here we just take the index from the "chars" list above). It's also useful to have a reverse mapping to get back to characters in order to do something with the model output. Finally, create a "mapped" corpus where each character in the data has been replaced with its corresponding integer.

```
char_indices = {c: i for i, c in enumerate(chars)}
indices_char = {i: c for i, c in enumerate(chars)}
idx = [char_indices[c] for c in text]
idx[:20]
```

[42, 44, 31, 32, 27, 29, 31, 0, 0, 0, 45, 47, 42, 42, 41, 45, 35, 40, 33, 1]

We can convert from integers back to characters using something like this.

```
''.join(indices_char[i] for i in idx[:100])
```

'preface\n\n\nsupposing that truth is a woman--what then? is there not ground\nfor suspecting that all ph'

For our first attempt, we'll build a model that accepts a 3-character sequence as input and tries to predict the following character in the text. For simplicity, we can just manually create each character sequence. Start by creating lists that take every 3rd character, offset by some amount between 0 and 3.

```
cs = 3
c1 = [idx[i] for i in range(0, len(idx) - cs, cs)]
c2 = [idx[i + 1] for i in range(0, len(idx) - cs, cs)]
c3 = [idx[i + 2] for i in range(0, len(idx) - cs, cs)]
c4 = [idx[i + 3] for i in range(0, len(idx) - cs, cs)]
```

This just converts the lists to numpy arrays. Notice that this approach resulted in non-overlapping sequences, i.e. we use characters 0-2 to predict character 3, then characters 3-5 to predict character 6, etc. That's why the array shape is about 1/3 the size of the original text. We'll see how to improve on this later.

```
x1 = np.stack(c1)
x2 = np.stack(c2)
x3 = np.stack(c3)
y = np.stack(c4)
x1.shape, y.shape
```

((200297,), (200297,))

Our model will use embeddings to represent each character. This is why we converted them to integers before - each integer gets turned into a vector in the embedding layer. Set some variables for the embedding vector size and the number of hidden units to use in the model. Finally, we need to convert the target variable to a one-hot character encoding. This is because the model outputs a probability for each character, and in order to score this properly it needs to be able to compare that output with an array that's structured the same way.

```
n_factors = 42
n_hidden = 256
y_cat = keras.utils.to_categorical(y)
y_cat.shape
```

(200297, 56)

Now we get to the first iteration of our model. The way I've structred this is by defining the layers of the model so that they can be re-used across multiple inputs. For example, rather than create an embedding layer for each of the three character inputs, we're instead creating one embedding layer and sharing it. This is a reasonable approach to handling sequences since each input comes from an identical distribution.

The next thing to observe is the part where h is defined. The first character is fed through the hidden layer like normal, but the other characters in the sequence are doing something different. We're re-using the same layer, but instead of just taking the character as input, we're using the character + the previous output h. This is the "hidden" state of the model. I think about it in the following way: "give me the output of this layer for character c conditioned on the fact that these other characters (represented by h) came before it".

You'll notice that there's no use of an RNN class at all. Basically what's going on here is we're implmenting an "unrolled" RNN from scratch on our own.

```
from keras import backend as K
from keras.models import Model
from keras.layers import add
from keras.layers import Input, Reshape, Dense, Add
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
def Char3Model(vocab_size, n_factors, n_hidden):
embed_layer = Embedding(vocab_size, n_factors)
reshape_layer = Reshape((n_factors,))
input_layer = Dense(n_hidden, activation='relu')
hidden_layer = Dense(n_hidden, activation='tanh')
output_layer = Dense(vocab_size - 1, activation='softmax')
in1 = Input(shape=(1,))
in2 = Input(shape=(1,))
in3 = Input(shape=(1,))
c1 = input_layer(reshape_layer(embed_layer(in1)))
c2 = input_layer(reshape_layer(embed_layer(in2)))
c3 = input_layer(reshape_layer(embed_layer(in3)))
h = hidden_layer(c1)
h = hidden_layer(add([h, c2]))
h = hidden_layer(add([h, c3]))
out = output_layer(h)
model = Model(inputs=[in1, in2, in3], outputs=out)
opt = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=opt)
return model
```

Train the model for a few iterations.

```
model = Char3Model(vocab_size, n_factors, n_hidden)
history = model.fit(x=[x1, x2, x3], y=y_cat, batch_size=512, epochs=3, verbose=1)
```

Epoch 1/3 200297/200297 [==============================] - 4s 20us/step - loss: 2.4007 Epoch 2/3 200297/200297 [==============================] - 2s 10us/step - loss: 2.0852 Epoch 3/3 200297/200297 [==============================] - 2s 10us/step - loss: 1.9470

In order to make sense of the model's output, we need a helper function that converts the character probability array that it returns into an actual character. This is where the reverse lookup table we created earlier comes in handy!

```
def get_next_char(model, s):
idxs = [np.array([char_indices[c]]) for c in s]
pred = model.predict(idxs)
char_idx = np.argmax(pred)
return chars[char_idx]
get_next_char(model, ' th')
```

'e'

```
get_next_char(model, 'and')
```

' '

It appears to be spitting out sensible results. The 3-character approach is very limiting though. That's not enough context for even a full word most of the time. For our next step, let's expand the input window to 8 characters. We can create an input array using some list comprehension magic to output a list of lists, then stacking them together into an array. Try experimenting with the logic below yourself to get a better sense of what it's doing. The target array is created in a similar manner as before.

```
cs = 8
c_in = [[idx[i + j] for i in range(cs)] for j in range(len(idx) - cs)]
c_out = [idx[j + cs] for j in range(len(idx) - cs)]
X = np.stack(c_in, axis=0)
y = np.stack(c_out)
```

Notice this time we're making better use of our data by making the sequences overlapping. For example, the first "row" in the data uses characters 0-7 to predict character 8. The next "row" uses characters 1-8 to predict character 9, and so on. We just increment by one each time. It does create a lot of duplicate data, but that's not a huge issue with a corpus of this size.

```
X.shape, y.shape
```

((600885, 8), (600885,))

It helps to look at an example to see how the data is formatted. Each row is a sequence of 8 characters from the text. As you go down the rows it's apparent they're offset by one character.

```
X[:cs, :cs]
```

array([[42, 44, 31, 32, 27, 29, 31, 0], [44, 31, 32, 27, 29, 31, 0, 0], [31, 32, 27, 29, 31, 0, 0, 0], [32, 27, 29, 31, 0, 0, 0, 45], [27, 29, 31, 0, 0, 0, 45, 47], [29, 31, 0, 0, 0, 45, 47, 42], [31, 0, 0, 0, 45, 47, 42, 42], [ 0, 0, 0, 45, 47, 42, 42, 41]])

```
y[:cs]
```

array([ 0, 0, 45, 47, 42, 42, 41, 45])

Since we have separate inputs for each character, Keras expects separate arrays rather than one big array. Also need to one-hot encode the target again.

```
X_array = [X[:, i] for i in range(X.shape[1])]
y_cat = keras.utils.to_categorical(y)
```

The 8-character model works exactly the same way as the 3-character model, there are just more of the same steps. Rather than write them all out in code, I converted it to a loop. Again, this is almost exactly the way an RNN works under the hood.

```
def CharLoopModel(vocab_size, n_chars, n_factors, n_hidden):
embed_layer = Embedding(vocab_size, n_factors)
reshape_layer = Reshape((n_factors,))
input_layer = Dense(n_hidden, activation='relu')
hidden_layer = Dense(n_hidden, activation='tanh')
output_layer = Dense(vocab_size, activation='softmax')
inputs = []
for i in range(n_chars):
inp = Input(shape=(1,))
inputs.append(inp)
c = input_layer(reshape_layer(embed_layer(inp)))
if i == 0:
h = hidden_layer(c)
else:
h = hidden_layer(add([h, c]))
out = output_layer(h)
model = Model(inputs=inputs, outputs=out)
opt = Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)
return model
```

Train the model a bit and generate some predictions.

```
model = CharLoopModel(vocab_size, cs, n_factors, n_hidden)
history = model.fit(x=X_array, y=y_cat, batch_size=512, epochs=5, verbose=1)
```

Epoch 1/5 600885/600885 [==============================] - 9s 15us/step - loss: 2.1838 Epoch 2/5 600885/600885 [==============================] - 8s 13us/step - loss: 1.7245 Epoch 3/5 600885/600885 [==============================] - 8s 13us/step - loss: 1.5888 Epoch 4/5 600885/600885 [==============================] - 8s 13us/step - loss: 1.5200 Epoch 5/5 600885/600885 [==============================] - 8s 13us/step - loss: 1.4781

```
get_next_char(model, 'for thos')
```

'e'

```
get_next_char(model, 'queens a')
```

'n'

Now we're ready to replace the loop with a real recurrent layer. The first thing to notice is that we no longer need to create separate inputs for each step in the sequence - recurrent layers in Keras are designed to accept 3-dimensional arrays where the 2nd dimension is the number of timesteps. We just need to add an extra dimension to the input shape with the number of characters.

The second wrinkle is the use of the "TimeDistributed" class on the embedding layer. Just as with the input, this is another more convenient way of doing what we were already doing by defining and re-using layers. Wrapping a layer with "TimeDistributed" basically says "apply this to every timestep in the array". Like the RNN, it expects (and returns) a 3-dimensional array. The reshape operation is the same story, we just add another dimension to it. The RNN layer itself is very straightforward.

```
from keras.layers import TimeDistributed, SimpleRNN
def CharRnn(vocab_size, n_chars, n_factors, n_hidden):
i = Input(shape=(n_chars, 1))
x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
x = Reshape((n_chars, n_factors))(x)
x = SimpleRNN(n_hidden, activation='tanh')(x)
x = Dense(vocab_size, activation='softmax')(x)
model = Model(inputs=i, outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)
return model
```

Let's look at a summary of the model. Notice the array shapes have a third dimension to them until we get on the other side of the RNN.

```
model = CharRnn(vocab_size, cs, n_factors, n_hidden)
model.summary()
```

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_12 (InputLayer) (None, 8, 1) 0 _________________________________________________________________ time_distributed_1 (TimeDist (None, 8, 1, 42) 2394 _________________________________________________________________ reshape_3 (Reshape) (None, 8, 42) 0 _________________________________________________________________ simple_rnn_1 (SimpleRNN) (None, 256) 76544 _________________________________________________________________ dense_7 (Dense) (None, 57) 14649 ================================================================= Total params: 93,587 Trainable params: 93,587 Non-trainable params: 0 _________________________________________________________________

Reshape the input to match the 3-dimensional input format of (rows, timesteps, features). Since we only have one feature, the last dimension is trivially set to one.

```
X = X.reshape((X.shape[0], cs, 1))
X.shape
```

(600885, 8, 1)

Train the model for a bit. Notice that the loss looks almost identical to the last model! All we really did is shuffle things around to take advantage of some built-in classes that Keras provides. The model structure and performance should look no different than before.

```
history = model.fit(x=X, y=y_cat, batch_size=512, epochs=5, verbose=1)
```

Epoch 1/5 600885/600885 [==============================] - 11s 18us/step - loss: 2.2863 Epoch 2/5 600885/600885 [==============================] - 11s 18us/step - loss: 1.8356 Epoch 3/5 600885/600885 [==============================] - 10s 17us/step - loss: 1.6601 Epoch 4/5 600885/600885 [==============================] - 10s 17us/step - loss: 1.5672 Epoch 5/5 600885/600885 [==============================] - 10s 17us/step - loss: 1.5102

We can train it a bit longer at a lower learning rate to reduce the loss further.

```
K.set_value(model.optimizer.lr, 0.0001)
history = model.fit(x=X, y=y_cat, batch_size=512, epochs=3, verbose=1)
```

Epoch 1/3 600885/600885 [==============================] - 10s 17us/step - loss: 1.4350 Epoch 2/3 600885/600885 [==============================] - 10s 17us/step - loss: 1.4233 Epoch 3/3 600885/600885 [==============================] - 10s 17us/step - loss: 1.4175

```
def get_next_char(model, s):
idxs = np.array([char_indices[c] for c in s])
idxs = idxs.reshape((1, idxs.shape[0], 1))
pred = model.predict(idxs)
char_idx = np.argmax(pred)
return chars[char_idx]
get_next_char(model, 'for thos')
```

'e'

Since the model is getting better, we can now try to generate more than one character of text. All we need is an initial seed of 8 characters and it can go on as long as we like. To do this, we'll create a simple helper function that continuously predicts the next character using the last 8 characters that it spit out (starting with the seed value).

```
def get_next_n_chars(model, s, n):
r = s
for i in range(n):
c = get_next_char(model, s)
r += c
s = s[1:] + c
return r
get_next_n_chars(model, 'for thos', 40)
```

'for those who has not to be a conscience of the '

It's definitely getting better. There are more improvements we can make though! In the current model, each instance of the data is completely independent. When a new sequence comes in, the model has no idea what came before that sequence. That "hidden state" mentioned earlier (which is now part of the RNN layer) gets thrown away. However, there's a way we can set this up that persists that hidden state through to the next part of the sequence. In other words, it conditions the output not only on the current 8 characters but all the characters that came before it as well.

The good news is that this capability is built into Keras's recurrent layers, we just need to set a flag to true! The bad news is that we need to re-think how the data is structured. Stateful models require 1) a fixed batch size, which is specified in the model input, and 2) that each batch be a "slice" of sequences such that the next batch contains the next part of each sequence. In other words, we need to split up our data (which is one long continuous stream of text) into n chunks of equal-length streams of text, where n is the batch size. Then, we need to carve up these n chunks into sequences of length 8 (which is the sequence length the model looks at) with the following character in each sequence being the target (the thing we're predicting).

If that sounds confusing and complicated, that's because it is. It took me a while to make sense of it (and figure out how to express it in code) but hopefully you can follow along. Below is the first step, which splits the data up into chunks and stacks them vertically into an array. The result is 64 equal-length continuous sequences of text.

```
bs = 64
seg_len = len(text) // bs
segments = [idx[i*seg_len:(i+1)*seg_len] for i in range(bs)]
segments = np.stack(segments)
segments.shape
```

(64, 9388)

One other change happening at the same time is we're no longer staggering the input by one character (which duplicates a lot of text because most of it is repeated in each row). Instead, we're now carving the data into chucks of non-overlapping characters like we did originally. However, we're going to make better use of it this time. Instead of just predicting character 8 based on characters 0-7, we're going to predict characters 1-8 conditioned on the characters in the sequence that came before them. Each pass will actually be 8 character predictions, and the loss function will be calculated across all of those outputs (we'll see how to do this in a minute).

Below we're creating a list of lists, where each sub-list is an 8-character sequence. The second list is offset by one (this is our target).

```
c_in = [segments[:,i*cs:(i+1)*cs] for i in range(seg_len // cs)]
c_out = [segments[:,(i*cs)+1:((i+1)*cs)+1] for i in range(seg_len // cs)]
```

Now we just need to concatenate and reshape these into arrays that we can use with the model. We end up with ~75,000 chunks of unique 8-character sequences.

```
X = np.concatenate(c_in)
X = X.reshape((X.shape[0], X.shape[1], 1))
y = np.concatenate(c_out)
y_cat = keras.utils.to_categorical(y)
X.shape, y_cat.shape
```

((75072, 8, 1), (75072, 8, 57))

Crucially, they are ordered such that the 65th row is a continuation of the 1st row, the 66th row is a continuation of the 2nd row, and so on all the way down.

```
''.join(indices_char[i] for i in np.concatenate((X[0,:,0], X[64,:,0], X[128,:,0])))
```

'preface\n\n\nsupposing that'

Next we can create the stateful RNN model. It's similar to the last one but there are a few wrinkles. The input specifies "batch_shape" and has three dimensions (this is a hard requirement to use stateful RNNs in Keras, and gets quite annoying during inference time). We've set "return_sequences" to true, which changes the shape that the RNN returns and gives us an output for each step in the sequence. We've set "stateful" to true, the motivation for which was already discussed. Finally, we've wrapped the last dense layer with "TimeDistributed". This is because the RNN is now returning a higher-dimensional array to account for the output at each timestep. Everything else works basically the same way.

```
def CharStatefulRnn(vocab_size, n_chars, n_factors, n_hidden, bs):
i = Input(batch_shape=(bs, n_chars, 1))
x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
x = Reshape((n_chars, n_factors))(x)
x = SimpleRNN(n_hidden, activation='tanh', return_sequences=True, stateful=True)(x)
x = TimeDistributed(Dense(vocab_size, activation='softmax'))(x)
model = Model(inputs=i, outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)
return model
```

Looking at the output shapes, we can see the effect of turning on "return_sequences". Note that the number of model parameters has not changed. The complexity is identical, we've just changed the task and the information available to solve it.

```
model = CharStatefulRnn(vocab_size, cs, n_factors, n_hidden, bs)
model.summary()
```

_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_13 (InputLayer) (64, 8, 1) 0 _________________________________________________________________ time_distributed_2 (TimeDist (64, 8, 1, 42) 2394 _________________________________________________________________ reshape_4 (Reshape) (64, 8, 42) 0 _________________________________________________________________ simple_rnn_2 (SimpleRNN) (64, 8, 256) 76544 _________________________________________________________________ time_distributed_3 (TimeDist (64, 8, 57) 14649 ================================================================= Total params: 93,587 Trainable params: 93,587 Non-trainable params: 0 _________________________________________________________________

One quirk of using stateful RNNs is that we now have to manually reset the model state, it never goes away until we tell it to. I just created a simple callback that resets the state at the end of every epoch.

```
from keras.callbacks import Callback
class ResetModelState(Callback):
def on_epoch_end(self, epoch, logs):
self.model.reset_states()
reset_state = ResetModelState()
```

Train the model for a while as before, with the addition of the callback to reset state between epochs.

```
model.fit(x=X, y=y_cat, batch_size=bs, epochs=8, verbose=1, callbacks=[reset_state], shuffle=False)
```

Epoch 1/8 75072/75072 [==============================] - 21s 277us/step - loss: 2.2509 Epoch 2/8 75072/75072 [==============================] - 20s 261us/step - loss: 1.8441 Epoch 3/8 75072/75072 [==============================] - 20s 263us/step - loss: 1.6865 Epoch 4/8 75072/75072 [==============================] - 19s 259us/step - loss: 1.6052 Epoch 5/8 75072/75072 [==============================] - 20s 261us/step - loss: 1.5540 Epoch 6/8 75072/75072 [==============================] - 20s 262us/step - loss: 1.5186 Epoch 7/8 75072/75072 [==============================] - 20s 261us/step - loss: 1.4922 Epoch 8/8 75072/75072 [==============================] - 20s 263us/step - loss: 1.4714

```
K.set_value(model.optimizer.lr, 0.0001)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=3, verbose=1, callbacks=[reset_state], shuffle=False)
```

Epoch 1/3 75072/75072 [==============================] - 19s 259us/step - loss: 1.4280 Epoch 2/3 75072/75072 [==============================] - 19s 259us/step - loss: 1.4191 Epoch 3/3 75072/75072 [==============================] - 20s 264us/step - loss: 1.4154

The "get next" functions need to be updated since our approach has changed. One of the annoying things about stateful models is the batch size is fixed, so even when making a prediction it needs an array of the same size, no matter if we just want to predict one sequence. I got around this with some numpy hackery.

```
def get_next_char(model, bs, s):
idxs = np.array([char_indices[c] for c in s])
idxs = idxs.reshape((1, idxs.shape[0], 1))
idxs = np.repeat(idxs, bs, axis=0)
pred = model.predict(idxs, batch_size=bs)
char_idx = np.argmax(pred[0, 7])
return chars[char_idx]
def get_next_n_chars(model, bs, s, n):
r = s
for i in range(n):
c = get_next_char(model, bs, s)
r += c
s = s[1:] + c
return r
get_next_n_chars(model, bs, 'for thos', 40)
```

'for those in the same the same the same the same'

The output is actually a bit worse than before, but we're still using simple RNNs which aren't that great to begin with. The real fun comes when we make the jump to a more complex unit like the LSTM. The details of LSTM's are beyond my scope here but there's a great blog post that everyone links to as the canonical explainer for LTMS, which you can find here. This is the easiest step yet as the only thing we need to do is replace the class name. The only other change I made is increasing the number of hidden units. Everything else stays exactly the same.

```
from keras.layers import LSTM
n_hidden = 512
def CharStatefulLSTM(vocab_size, n_chars, n_factors, n_hidden, bs):
i = Input(batch_shape=(bs, n_chars, 1))
x = TimeDistributed(Embedding(vocab_size, n_factors))(i)
x = Reshape((n_chars, n_factors))(x)
x = LSTM(n_hidden, return_sequences=True, stateful=True)(x)
x = TimeDistributed(Dense(vocab_size, activation='softmax'))(x)
model = Model(inputs=i, outputs=x)
opt = Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)
return model
```

LSTMs need to train for a bit longer. We'll do 20 epochs at each learning rate.

```
model = CharStatefulLSTM(vocab_size, cs, n_factors, n_hidden, bs)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=20, verbose=1, callbacks=[reset_state], shuffle=False)
```

Epoch 1/20 75072/75072 [==============================] - 30s 401us/step - loss: 2.1748 Epoch 2/20 75072/75072 [==============================] - 29s 380us/step - loss: 1.6091 Epoch 3/20 75072/75072 [==============================] - 29s 381us/step - loss: 1.4487 Epoch 4/20 75072/75072 [==============================] - 28s 379us/step - loss: 1.3695 Epoch 5/20 75072/75072 [==============================] - 29s 383us/step - loss: 1.3181 Epoch 6/20 75072/75072 [==============================] - 29s 385us/step - loss: 1.2797 Epoch 7/20 75072/75072 [==============================] - 29s 382us/step - loss: 1.2500 Epoch 8/20 75072/75072 [==============================] - 29s 388us/step - loss: 1.2254 Epoch 9/20 75072/75072 [==============================] - 29s 380us/step - loss: 1.2052 Epoch 10/20 75072/75072 [==============================] - 28s 377us/step - loss: 1.1886 Epoch 11/20 75072/75072 [==============================] - 29s 385us/step - loss: 1.1754 Epoch 12/20 75072/75072 [==============================] - 28s 379us/step - loss: 1.1649 Epoch 13/20 75072/75072 [==============================] - 29s 385us/step - loss: 1.1563 Epoch 14/20 75072/75072 [==============================] - 29s 390us/step - loss: 1.1499 Epoch 15/20 75072/75072 [==============================] - 29s 383us/step - loss: 1.1447 Epoch 16/20 75072/75072 [==============================] - 28s 377us/step - loss: 1.1404 Epoch 17/20 75072/75072 [==============================] - 29s 384us/step - loss: 1.1371 Epoch 18/20 75072/75072 [==============================] - 29s 383us/step - loss: 1.1334 Epoch 19/20 75072/75072 [==============================] - 28s 379us/step - loss: 1.1328 Epoch 20/20 75072/75072 [==============================] - 28s 378us/step - loss: 1.1314

```
K.set_value(model.optimizer.lr, 0.0001)
model.fit(x=X, y=y_cat, batch_size=bs, epochs=20, verbose=1, callbacks=[reset_state], shuffle=False)
```

Epoch 1/20 75072/75072 [==============================] - 29s 382us/step - loss: 1.1015 Epoch 2/20 75072/75072 [==============================] - 32s 428us/step - loss: 1.0755 Epoch 3/20 75072/75072 [==============================] - 33s 442us/step - loss: 1.0633 Epoch 4/20 75072/75072 [==============================] - 30s 406us/step - loss: 1.0552 Epoch 5/20 75072/75072 [==============================] - 29s 381us/step - loss: 1.0489 Epoch 6/20 75072/75072 [==============================] - 28s 372us/step - loss: 1.0434 Epoch 7/20 75072/75072 [==============================] - 28s 372us/step - loss: 1.0392 Epoch 8/20 75072/75072 [==============================] - 29s 381us/step - loss: 1.0354 Epoch 9/20 75072/75072 [==============================] - 28s 376us/step - loss: 1.0323 Epoch 10/20 75072/75072 [==============================] - 28s 379us/step - loss: 1.0293 Epoch 11/20 75072/75072 [==============================] - 28s 379us/step - loss: 1.0264 Epoch 12/20 75072/75072 [==============================] - 28s 373us/step - loss: 1.0246 Epoch 13/20 75072/75072 [==============================] - 28s 376us/step - loss: 1.0224 Epoch 14/20 75072/75072 [==============================] - 28s 373us/step - loss: 1.0203 Epoch 15/20 75072/75072 [==============================] - 29s 382us/step - loss: 1.0183 Epoch 16/20 75072/75072 [==============================] - 28s 376us/step - loss: 1.0162 Epoch 17/20 75072/75072 [==============================] - 28s 376us/step - loss: 1.0150 Epoch 18/20 75072/75072 [==============================] - 28s 376us/step - loss: 1.0134 Epoch 19/20 75072/75072 [==============================] - 28s 377us/step - loss: 1.0125 Epoch 20/20 75072/75072 [==============================] - 28s 378us/step - loss: 1.0108

And now the moment of truth!

```
pprint(get_next_n_chars(model, bs, 'for thos', 400))
```

('for those whoever be no longer for their shows that is the basic of the ' 'conseque of the conseque once more proves and the same of the consequent, ' 'and at the other that is the basic of the conseque perfeaced itself to the ' 'sense and self-conseque contemptations of the conseque once still that the ' 'great people take a soul as a profoundination and an artistic as something ' 'might be the most problem and self-co')

Ha, well I wouldn't quite call it sensible but it's not super-terrible either. It's forming mostly complete words, occasionally using punctuation, etc. Not bad for being trained one character at a time. There are many ways that this can be improved of course, but hopefully this has illustrated the key concepts to building a sequence model.