Deep Learning With Keras: Convolutional Networks

8th January 2019

In my last post, I kicked off a series on deep learning by showing how to apply several core neural network concepts such as dense layers, embeddings, and regularization to build models using structured and/or time-series data. In this post we'll see how to build models using another core component in modern deep learning: convolutions. Convolutional layers are primarily used in image-based models but have some interesting properties that make them useful for sequential data as well. The biggest wrinkle that convolutional layers introduce is an element of locality. They have a receptive field that consists of some subset of the input data. In essence, each convolution can only "see" part of the image, sequence etc. that it's being trained on.

I'm not going to cover convolutional layers in-depth here, there are tons of great resources out there already to learn about them. If you're new to the concept, I would recommend this blog series as a starting point or just do some googling for explainers. There's a lot of good content that comes up.

We'll start with a simple dense network and gradually improve it until we're getting pretty good results classifying images in the CIFAR 10 data set. We'll then see how we can avoid building a network from scratch by taking a large, pre-trained net and fine-tuning it to a custom domain. As with my first post in this series, much of this content is originally based on Jeremy Howard's fast.ai lessons. I've combined content from a few different lessons and converted code to use Keras instead of PyTorch.

Since Keras comes with a pre-built data loader for CIFAR 10, we can just use that to get started instead of worrying about locating and importing the data.

%matplotlib inline
import matplotlib.pyplot as plt
from keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((50000, 32, 32, 3), (50000, 1), (10000, 32, 32, 3), (10000, 1))

Plot a few of the images to get an idea what they look like and confirm that the data loaded correctly. You'll quickly notice the CIFAR 10 images are very low resolution (32 x 32 images with 3 color channels). This makes training from scratch quite feasible even on modest compute resources.

def plot_image(index):
    image = x_train[index, :, :, :]
    plt.imshow(image)

plot_image(4)

plot_image(6)

conv2

We need to do a data conversion to get the class labels in one-hot encoded format. This will allow us to use a softmax activation and categorical cross-entopy loss in our network. CIFAR 10 only has 10 distinct classes so this is fairly straightforward.

import keras

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

y_train[0]

array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.], dtype=float32)

The only other pre-processing step to apply is normalizing the input data. Since everything is an RGB value, we can keep it simple and just divide by 255.

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

Define a few useful configuration items to use throughout the exercise. The input shape variable will have a value of (32, 32, 3) corresponding to the shape of the array for each image.

in_shape = x_train.shape[1:]
batch_size = 256
n_classes = 10
lr = 0.01

Now we can get started with the actual modeling part. For a first attempt, let's do the simplest and most naive model possible. We'll just create a straightforward fully-connected model and stick a softmax activation on at the end.

from keras.models import Model
from keras.layers import Activation, Dense, Flatten, Input
from keras.optimizers import Adam

def SimpleNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    x = Flatten()(i)
    
    for n in range(len(layers)):
        x = Dense(layers[n])(x)
        x = Activation('relu')(x)
    
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

Note that the architecture is somewhat flexible in that we can define as many dense layers as we want by just passing in a list of numbers to the "layers" parameter (where the numbers correspond to the size of the layer). In this case we're only going to use one layer, but this capability will be very useful later on.

model = SimpleNet(in_shape, [40], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 40)                122920    
_________________________________________________________________
activation_1 (Activation)    (None, 40)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                410       
_________________________________________________________________
activation_2 (Activation)    (None, 10)                0         
=================================================================
Total params: 123,330
Trainable params: 123,330
Non-trainable params: 0
_________________________________________________________________

Our last step before training is to define an image data generator. We could just train on the images as-is, but randomly applying transformations to the images will make the classifier more robust. Keras has a utility class built in for just this purpose, so we can use that to randomly shift or flip the direction of the images during training.

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)

Let's try training for 10 epochs and see what happens!

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 26s 134ms/step - loss: 2.4531 - acc: 0.0974 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 2/10
196/196 [==============================] - 23s 120ms/step - loss: 2.2576 - acc: 0.1268 - val_loss: 2.1575 - val_acc: 0.1623
Epoch 3/10
196/196 [==============================] - 24s 120ms/step - loss: 2.1256 - acc: 0.1741 - val_loss: 2.0836 - val_acc: 0.1764
Epoch 4/10
196/196 [==============================] - 24s 122ms/step - loss: 2.1123 - acc: 0.1760 - val_loss: 2.0775 - val_acc: 0.1972
Epoch 5/10
196/196 [==============================] - 23s 119ms/step - loss: 2.0938 - acc: 0.1802 - val_loss: 2.0716 - val_acc: 0.1710
Epoch 6/10
196/196 [==============================] - 24s 120ms/step - loss: 2.0940 - acc: 0.1784 - val_loss: 2.0660 - val_acc: 0.1875
Epoch 7/10
196/196 [==============================] - 24s 121ms/step - loss: 2.0894 - acc: 0.1822 - val_loss: 2.1032 - val_acc: 0.1765
Epoch 8/10
196/196 [==============================] - 24s 121ms/step - loss: 2.0954 - acc: 0.1799 - val_loss: 2.0751 - val_acc: 0.1745
Epoch 9/10
196/196 [==============================] - 23s 120ms/step - loss: 2.0853 - acc: 0.1788 - val_loss: 2.0702 - val_acc: 0.1743
Epoch 10/10
196/196 [==============================] - 23s 120ms/step - loss: 2.0889 - acc: 0.1775 - val_loss: 2.0659 - val_acc: 0.1844

Clearly the naive approach is not very effective. The model is basically doing a bit better than randomly guessing. Let's replace the dense layer with a few convolutional layers instead. For our first attempt at using convolutions, we'll use a kernel size of 3 and a stride of 2 (rather than use pooling layers in between the conv layers) and a global max pooling layer to condense the output shape before going through the softmax.

from keras.layers import Conv2D, GlobalMaxPooling2D

def ConvNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    for n in range(len(layers)):
        if n == 0:
            x = Conv2D(layers[n], kernel_size=3, strides=2)(i)
        else:
            x = Conv2D(layers[n], kernel_size=3, strides=2)(x)
        x = Activation('relu')(x)
    
    x = GlobalMaxPooling2D()(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

This time let's try using 3 conv layers with an increasing number of filters in each layer.

model = ConvNet(in_shape, [20, 40, 80], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 15, 15, 20)        560       
_________________________________________________________________
activation_3 (Activation)    (None, 15, 15, 20)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 7, 7, 40)          7240      
_________________________________________________________________
activation_4 (Activation)    (None, 7, 7, 40)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 80)          28880     
_________________________________________________________________
activation_5 (Activation)    (None, 3, 3, 80)          0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 80)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                810       
_________________________________________________________________
activation_6 (Activation)    (None, 10)                0         
=================================================================
Total params: 37,490
Trainable params: 37,490
Non-trainable params: 0
_________________________________________________________________

It's worth checking your intuition and understanding of what's going on by looking at the summary output and verifying that the numbers make sense. For instance, why does the first convolutional layer have 560 parameters? Where does that come from? Well, we have a kernel size of 3 which creates a 3 x 3 filter (i.e. 9 parameters), but we also have different color channels for a depth of 3, so each filter is really 3 x 3 x 3 = 27 parameters, plus 1 for the bias so 28 per filter. We specified 20 filters in the first layer, so 28 X 20 = 560. Try applying similar logic to the second conv layer and see if the result makes sense.

Now that we've got a model, let's try training it using the exact same approach as before.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 25s 127ms/step - loss: 1.8725 - acc: 0.3019 - val_loss: 1.7737 - val_acc: 0.3772
Epoch 2/10
196/196 [==============================] - 24s 120ms/step - loss: 1.6342 - acc: 0.4015 - val_loss: 1.5930 - val_acc: 0.4314
Epoch 3/10
196/196 [==============================] - 24s 120ms/step - loss: 1.5503 - acc: 0.4349 - val_loss: 1.5013 - val_acc: 0.4567
Epoch 4/10
196/196 [==============================] - 24s 122ms/step - loss: 1.4848 - acc: 0.4623 - val_loss: 1.4356 - val_acc: 0.4801
Epoch 5/10
196/196 [==============================] - 24s 122ms/step - loss: 1.4493 - acc: 0.4798 - val_loss: 1.3845 - val_acc: 0.4972
Epoch 6/10
196/196 [==============================] - 23s 119ms/step - loss: 1.4186 - acc: 0.4892 - val_loss: 1.3761 - val_acc: 0.5066
Epoch 7/10
196/196 [==============================] - 24s 121ms/step - loss: 1.3999 - acc: 0.4956 - val_loss: 1.3681 - val_acc: 0.5024
Epoch 8/10
196/196 [==============================] - 24s 121ms/step - loss: 1.3837 - acc: 0.5047 - val_loss: 1.4632 - val_acc: 0.4810
Epoch 9/10
196/196 [==============================] - 23s 120ms/step - loss: 1.3838 - acc: 0.5006 - val_loss: 1.3647 - val_acc: 0.5139
Epoch 10/10
196/196 [==============================] - 24s 120ms/step - loss: 1.3565 - acc: 0.5114 - val_loss: 1.3553 - val_acc: 0.5162

The results are a lot different this time! The model is clearly learning and after 10 epochs we're at about 50% accuracy on the validation set. Still, we should be able to do a lot better. For the next attempt let's introduce a few new wrinkles. First, we're going to add batch normalization after each conv layer. Second, we're going to add a single conv layer at the beginning with a larger kernel size and a stride of 1 so we don't reduce the receptive field. Third, we're going to introduce padding which will modify the shape of each conv layer output. Finally, we're going to add a few more layers to make the model bigger.

To make the model definition more modular, I've pulled out the conv layer into a separate class. There are multiple ways to do this (a function would have worked just as well) but I opted to mimic the way Keras's functional API works.

from keras.layers import BatchNormalization

class ConvLayer:
    def __init__(self, filters, kernel_size, stride):
        self.filters = filters
        self.kernel_size = kernel_size
        self.stride = stride

    def __call__(self, x):
        x = Conv2D(self.filters, kernel_size=self.kernel_size,
                   strides=self.stride, padding='same', use_bias=False)(x)
        x = Activation('relu')(x)
        x = BatchNormalization()(x)
        return x

def ConvNet2(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    x = Conv2D(layers[0], kernel_size=5, strides=1, padding='same')(i)
    x = Activation('relu')(x)
    
    for n in range(1, len(layers)):
        x = ConvLayer(layers[n], kernel_size=3, stride=2)(x)

    x = GlobalMaxPooling2D()(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

model = ConvNet2(in_shape, [10, 20, 40, 80, 160], n_classes, lr)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_3 (InputLayer)         (None, 32, 32, 3)         0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 32, 10)        760       
_________________________________________________________________
activation_7 (Activation)    (None, 32, 32, 10)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 16, 16, 20)        1800      
_________________________________________________________________
activation_8 (Activation)    (None, 16, 16, 20)        0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 16, 16, 20)        80        
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 8, 8, 40)          7200      
_________________________________________________________________
activation_9 (Activation)    (None, 8, 8, 40)          0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 8, 8, 40)          160       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 4, 4, 80)          28800     
_________________________________________________________________
activation_10 (Activation)   (None, 4, 4, 80)          0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 4, 4, 80)          320       
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 2, 2, 160)         115200    
_________________________________________________________________
activation_11 (Activation)   (None, 2, 2, 160)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 2, 2, 160)         640       
_________________________________________________________________
global_max_pooling2d_2 (Glob (None, 160)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 10)                1610      
_________________________________________________________________
activation_12 (Activation)   (None, 10)                0         
=================================================================
Total params: 156,570
Trainable params: 155,970
Non-trainable params: 600
_________________________________________________________________

We made a bunch of improvements and the network has a much larger capacity, so let's see what it does.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=10, validation_data=(x_test, y_test), workers=4)

Epoch 1/10
196/196 [==============================] - 24s 125ms/step - loss: 1.6451 - acc: 0.4258 - val_loss: 1.5408 - val_acc: 0.4597
Epoch 2/10
196/196 [==============================] - 23s 118ms/step - loss: 1.3130 - acc: 0.5280 - val_loss: 1.7158 - val_acc: 0.4559
Epoch 3/10
196/196 [==============================] - 24s 121ms/step - loss: 1.1669 - acc: 0.5803 - val_loss: 1.5101 - val_acc: 0.5311
Epoch 4/10
196/196 [==============================] - 23s 119ms/step - loss: 1.0642 - acc: 0.6205 - val_loss: 1.3304 - val_acc: 0.5538
Epoch 5/10
196/196 [==============================] - 23s 118ms/step - loss: 0.9887 - acc: 0.6485 - val_loss: 1.2749 - val_acc: 0.5955
Epoch 6/10
196/196 [==============================] - 23s 119ms/step - loss: 0.9264 - acc: 0.6717 - val_loss: 1.3210 - val_acc: 0.5819
Epoch 7/10
196/196 [==============================] - 23s 120ms/step - loss: 0.8812 - acc: 0.6887 - val_loss: 0.9221 - val_acc: 0.6807
Epoch 8/10
196/196 [==============================] - 23s 120ms/step - loss: 0.8437 - acc: 0.6985 - val_loss: 0.8809 - val_acc: 0.7012
Epoch 9/10
196/196 [==============================] - 24s 120ms/step - loss: 0.8196 - acc: 0.7083 - val_loss: 0.9064 - val_acc: 0.6873
Epoch 10/10
196/196 [==============================] - 24s 120ms/step - loss: 0.7897 - acc: 0.7194 - val_loss: 0.8259 - val_acc: 0.7179

That's a significant improvement! Our validation accuracy after 10 epochs jumped all the way from ~50% to ~70%. We're already doing pretty good, but there's one more major addition we can make that should bump performance even higher. A key addition to modern convolutional networks was the invention of residual layers, which introduce an "identity" connection to the output of a block of convolutions. Below I've added a new "ResLayer" class that inherits from "ConvLayer" but outputs the addition of the original input with the output from the conv layer. Building on the previous network, we've now added two residual layers to each "block" in the model definition. These residual layers have a stride of 1 so they don't change the shape of the output. Finally, we've added a bit of regularization to keep the model from overfitting too badly.

from keras import layers
from keras import regularizers
from keras.layers import Dropout

class ConvLayer:
    def __init__(self, filters, kernel_size, stride):
        self.filters = filters
        self.kernel_size = kernel_size
        self.stride = stride

    def __call__(self, x):
        x = Conv2D(self.filters, kernel_size=self.kernel_size,
                   strides=self.stride, padding='same', use_bias=False,
                   kernel_regularizer=regularizers.l2(1e-6))(x)
        x = Activation('relu')(x)
        x = BatchNormalization()(x)
        return x

class ResLayer(ConvLayer):
    def __call__(self, x):
        return layers.add([x, super().__call__(x)])

def ResNet(in_shape, layers, n_classes, lr):
    i = Input(shape=in_shape)
    
    x = Conv2D(layers[0], kernel_size=7, strides=1, padding='same')(i)
    x = Activation('relu')(x)

    for n in range(1, len(layers)):
        x = ConvLayer(layers[n], kernel_size=3, stride=2)(x)
        x = ResLayer(layers[n], kernel_size=3, stride=1)(x)
        x = ResLayer(layers[n], kernel_size=3, stride=1)(x)

    x = GlobalMaxPooling2D()(x)
    x = Dropout(0.1)(x)
    x = Dense(n_classes)(x)
    x = Activation('softmax')(x)
    
    model = Model(inputs=i, outputs=x)
    opt = Adam(lr=lr)
    model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
    
    return model

model = ResNet(in_shape, [10, 20, 40, 80, 160], n_classes, lr)
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_4 (InputLayer)            (None, 32, 32, 3)    0                                            
__________________________________________________________________________________________________
conv2d_9 (Conv2D)               (None, 32, 32, 10)   1480        input_4[0][0]                    
__________________________________________________________________________________________________
activation_13 (Activation)      (None, 32, 32, 10)   0           conv2d_9[0][0]                   
__________________________________________________________________________________________________
conv2d_10 (Conv2D)              (None, 16, 16, 20)   1800        activation_13[0][0]              
__________________________________________________________________________________________________
activation_14 (Activation)      (None, 16, 16, 20)   0           conv2d_10[0][0]                  
__________________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 16, 16, 20)   80          activation_14[0][0]              
__________________________________________________________________________________________________
conv2d_11 (Conv2D)              (None, 16, 16, 20)   3600        batch_normalization_5[0][0]      
__________________________________________________________________________________________________
activation_15 (Activation)      (None, 16, 16, 20)   0           conv2d_11[0][0]                  
__________________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 16, 16, 20)   80          activation_15[0][0]              
__________________________________________________________________________________________________
add_1 (Add)                     (None, 16, 16, 20)   0           batch_normalization_5[0][0]      
                                                                 batch_normalization_6[0][0]      
__________________________________________________________________________________________________
conv2d_12 (Conv2D)              (None, 16, 16, 20)   3600        add_1[0][0]                      
__________________________________________________________________________________________________
activation_16 (Activation)      (None, 16, 16, 20)   0           conv2d_12[0][0]                  
__________________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 16, 16, 20)   80          activation_16[0][0]              
__________________________________________________________________________________________________
add_2 (Add)                     (None, 16, 16, 20)   0           add_1[0][0]                      
                                                                 batch_normalization_7[0][0]      
__________________________________________________________________________________________________
conv2d_13 (Conv2D)              (None, 8, 8, 40)     7200        add_2[0][0]                      
__________________________________________________________________________________________________
activation_17 (Activation)      (None, 8, 8, 40)     0           conv2d_13[0][0]                  
__________________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 8, 8, 40)     160         activation_17[0][0]              
__________________________________________________________________________________________________
conv2d_14 (Conv2D)              (None, 8, 8, 40)     14400       batch_normalization_8[0][0]      
__________________________________________________________________________________________________
activation_18 (Activation)      (None, 8, 8, 40)     0           conv2d_14[0][0]                  
__________________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 8, 8, 40)     160         activation_18[0][0]              
__________________________________________________________________________________________________
add_3 (Add)                     (None, 8, 8, 40)     0           batch_normalization_8[0][0]      
                                                                 batch_normalization_9[0][0]      
__________________________________________________________________________________________________
conv2d_15 (Conv2D)              (None, 8, 8, 40)     14400       add_3[0][0]                      
__________________________________________________________________________________________________
activation_19 (Activation)      (None, 8, 8, 40)     0           conv2d_15[0][0]                  
__________________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 8, 8, 40)     160         activation_19[0][0]              
__________________________________________________________________________________________________
add_4 (Add)                     (None, 8, 8, 40)     0           add_3[0][0]                      
                                                                 batch_normalization_10[0][0]     
__________________________________________________________________________________________________
conv2d_16 (Conv2D)              (None, 4, 4, 80)     28800       add_4[0][0]                      
__________________________________________________________________________________________________
activation_20 (Activation)      (None, 4, 4, 80)     0           conv2d_16[0][0]                  
__________________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 4, 4, 80)     320         activation_20[0][0]              
__________________________________________________________________________________________________
conv2d_17 (Conv2D)              (None, 4, 4, 80)     57600       batch_normalization_11[0][0]     
__________________________________________________________________________________________________
activation_21 (Activation)      (None, 4, 4, 80)     0           conv2d_17[0][0]                  
__________________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 4, 4, 80)     320         activation_21[0][0]              
__________________________________________________________________________________________________
add_5 (Add)                     (None, 4, 4, 80)     0           batch_normalization_11[0][0]     
                                                                 batch_normalization_12[0][0]     
__________________________________________________________________________________________________
conv2d_18 (Conv2D)              (None, 4, 4, 80)     57600       add_5[0][0]                      
__________________________________________________________________________________________________
activation_22 (Activation)      (None, 4, 4, 80)     0           conv2d_18[0][0]                  
__________________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 4, 4, 80)     320         activation_22[0][0]              
__________________________________________________________________________________________________
add_6 (Add)                     (None, 4, 4, 80)     0           add_5[0][0]                      
                                                                 batch_normalization_13[0][0]     
__________________________________________________________________________________________________
conv2d_19 (Conv2D)              (None, 2, 2, 160)    115200      add_6[0][0]                      
__________________________________________________________________________________________________
activation_23 (Activation)      (None, 2, 2, 160)    0           conv2d_19[0][0]                  
__________________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 2, 2, 160)    640         activation_23[0][0]              
__________________________________________________________________________________________________
conv2d_20 (Conv2D)              (None, 2, 2, 160)    230400      batch_normalization_14[0][0]     
__________________________________________________________________________________________________
activation_24 (Activation)      (None, 2, 2, 160)    0           conv2d_20[0][0]                  
__________________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 2, 2, 160)    640         activation_24[0][0]              
__________________________________________________________________________________________________
add_7 (Add)                     (None, 2, 2, 160)    0           batch_normalization_14[0][0]     
                                                                 batch_normalization_15[0][0]     
__________________________________________________________________________________________________
conv2d_21 (Conv2D)              (None, 2, 2, 160)    230400      add_7[0][0]                      
__________________________________________________________________________________________________
activation_25 (Activation)      (None, 2, 2, 160)    0           conv2d_21[0][0]                  
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 2, 2, 160)    640         activation_25[0][0]              
__________________________________________________________________________________________________
add_8 (Add)                     (None, 2, 2, 160)    0           add_7[0][0]                      
                                                                 batch_normalization_16[0][0]     
__________________________________________________________________________________________________
global_max_pooling2d_3 (GlobalM (None, 160)          0           add_8[0][0]                      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 160)          0           global_max_pooling2d_3[0][0]     
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 10)           1610        dropout_1[0][0]                  
__________________________________________________________________________________________________
activation_26 (Activation)      (None, 10)           0           dense_5[0][0]                    
==================================================================================================
Total params: 771,690
Trainable params: 769,890
Non-trainable params: 1,800
__________________________________________________________________________________________________

The model summary is now getting quite large, but you can still follow through each layer and make sense of what's happening. Let's run this one last time and see what the results look like. We'll increase the epoch count since deeper networks tend to take longer to train.

model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=40, validation_data=(x_test, y_test), workers=4)

Epoch 1/40
196/196 [==============================] - 28s 145ms/step - loss: 1.9806 - acc: 0.3498 - val_loss: 7.4266 - val_acc: 0.0771
Epoch 2/40
196/196 [==============================] - 23s 118ms/step - loss: 1.5761 - acc: 0.4484 - val_loss: 2.0037 - val_acc: 0.3478
Epoch 3/40
196/196 [==============================] - 24s 124ms/step - loss: 1.5488 - acc: 0.4612 - val_loss: 14.3443 - val_acc: 0.1005
Epoch 4/40
196/196 [==============================] - 24s 122ms/step - loss: 1.6194 - acc: 0.4359 - val_loss: 2.5182 - val_acc: 0.2401
Epoch 5/40
196/196 [==============================] - 24s 121ms/step - loss: 1.5562 - acc: 0.4626 - val_loss: 2.0495 - val_acc: 0.3302
Epoch 6/40
196/196 [==============================] - 24s 121ms/step - loss: 1.6183 - acc: 0.4400 - val_loss: 2.9989 - val_acc: 0.1782
Epoch 7/40
196/196 [==============================] - 24s 121ms/step - loss: 1.4886 - acc: 0.4672 - val_loss: 1.3995 - val_acc: 0.4944
Epoch 8/40
196/196 [==============================] - 24s 121ms/step - loss: 1.3551 - acc: 0.5162 - val_loss: 1.3086 - val_acc: 0.5268
Epoch 9/40
196/196 [==============================] - 24s 123ms/step - loss: 1.2971 - acc: 0.5373 - val_loss: 1.2979 - val_acc: 0.5423
Epoch 10/40
196/196 [==============================] - 24s 121ms/step - loss: 1.2737 - acc: 0.5507 - val_loss: 8.2801 - val_acc: 0.1325
Epoch 11/40
196/196 [==============================] - 24s 123ms/step - loss: 1.3697 - acc: 0.5350 - val_loss: 1.2361 - val_acc: 0.5742
Epoch 12/40
196/196 [==============================] - 24s 121ms/step - loss: 1.2410 - acc: 0.5652 - val_loss: 1.1365 - val_acc: 0.6007
Epoch 13/40
196/196 [==============================] - 24s 121ms/step - loss: 1.1514 - acc: 0.5958 - val_loss: 1.1343 - val_acc: 0.6118
Epoch 14/40
196/196 [==============================] - 24s 122ms/step - loss: 1.1079 - acc: 0.6096 - val_loss: 1.1276 - val_acc: 0.6092
Epoch 15/40
196/196 [==============================] - 24s 121ms/step - loss: 1.0586 - acc: 0.6306 - val_loss: 1.0696 - val_acc: 0.6330
Epoch 16/40
196/196 [==============================] - 23s 119ms/step - loss: 1.0240 - acc: 0.6437 - val_loss: 1.0270 - val_acc: 0.6596
Epoch 17/40
196/196 [==============================] - 24s 122ms/step - loss: 0.9809 - acc: 0.6611 - val_loss: 1.0828 - val_acc: 0.6391
Epoch 18/40
196/196 [==============================] - 24s 121ms/step - loss: 0.9591 - acc: 0.6685 - val_loss: 0.9332 - val_acc: 0.6848
Epoch 19/40
196/196 [==============================] - 24s 122ms/step - loss: 0.9166 - acc: 0.6860 - val_loss: 0.9894 - val_acc: 0.6632
Epoch 20/40
196/196 [==============================] - 24s 121ms/step - loss: 0.8854 - acc: 0.6983 - val_loss: 1.1848 - val_acc: 0.6169
Epoch 21/40
196/196 [==============================] - 24s 122ms/step - loss: 0.8659 - acc: 0.7045 - val_loss: 0.9105 - val_acc: 0.6978
Epoch 22/40
196/196 [==============================] - 24s 122ms/step - loss: 0.8366 - acc: 0.7162 - val_loss: 0.8779 - val_acc: 0.7132
Epoch 23/40
196/196 [==============================] - 23s 120ms/step - loss: 0.8175 - acc: 0.7252 - val_loss: 1.8874 - val_acc: 0.5708
Epoch 24/40
196/196 [==============================] - 24s 120ms/step - loss: 0.8383 - acc: 0.7203 - val_loss: 0.9611 - val_acc: 0.6878
Epoch 25/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7910 - acc: 0.7360 - val_loss: 0.8956 - val_acc: 0.7037
Epoch 26/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7728 - acc: 0.7445 - val_loss: 0.8712 - val_acc: 0.7297
Epoch 27/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7532 - acc: 0.7514 - val_loss: 0.8697 - val_acc: 0.7191
Epoch 28/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7419 - acc: 0.7568 - val_loss: 0.7995 - val_acc: 0.7405
Epoch 29/40
196/196 [==============================] - 24s 122ms/step - loss: 0.7385 - acc: 0.7599 - val_loss: 0.8080 - val_acc: 0.7451
Epoch 30/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7202 - acc: 0.7663 - val_loss: 0.9121 - val_acc: 0.7253
Epoch 31/40
196/196 [==============================] - 24s 121ms/step - loss: 0.7078 - acc: 0.7737 - val_loss: 0.8999 - val_acc: 0.7223
Epoch 32/40
196/196 [==============================] - 24s 120ms/step - loss: 0.6969 - acc: 0.7756 - val_loss: 0.9682 - val_acc: 0.7135
Epoch 33/40
196/196 [==============================] - 24s 121ms/step - loss: 0.6851 - acc: 0.7825 - val_loss: 0.8145 - val_acc: 0.7456
Epoch 34/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6800 - acc: 0.7859 - val_loss: 0.7972 - val_acc: 0.7585
Epoch 35/40
196/196 [==============================] - 23s 118ms/step - loss: 0.6689 - acc: 0.7919 - val_loss: 0.7807 - val_acc: 0.7654
Epoch 36/40
196/196 [==============================] - 24s 122ms/step - loss: 0.6626 - acc: 0.7949 - val_loss: 0.8022 - val_acc: 0.7509
Epoch 37/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6550 - acc: 0.7987 - val_loss: 0.8129 - val_acc: 0.7613
Epoch 38/40
196/196 [==============================] - 24s 122ms/step - loss: 0.6532 - acc: 0.8006 - val_loss: 0.8861 - val_acc: 0.7359
Epoch 39/40
196/196 [==============================] - 23s 119ms/step - loss: 0.6419 - acc: 0.8043 - val_loss: 0.8233 - val_acc: 0.7568
Epoch 40/40
196/196 [==============================] - 24s 124ms/step - loss: 0.6308 - acc: 0.8109 - val_loss: 0.7809 - val_acc: 0.7670

The results look pretty good. We're starting to hit the point where accuracy improvements are getting harder to come by. It's definitely possible to keep improving the model with the right tuning and augmentation strategies, however diminishing returns start to kick in relative to the effort involved. Also, as the network keeps getting bigger (and as we graduate to larger and more complex data sets) it starts becoming much, much harder to build a network from scratch.

Fortunately there's an alternative solution via transfer learning, which takes a model trained on one task and adapts it to another task. Combined with pre-training, which is the practice of using a model that's already been trained for a given task, we can take very large networks developed by i.e. Google and Facebook and then fine-tune them to work in a custom domain of our choosing. Below I'll walk through an example of how this works by using a pre-trained ImageNet model and adapting it to Kaggle's dogs vs cats data set.

First get some imports out of the way. We'll need all of this stuff throughout the exercise.

import numpy as np
from keras.applications import ResNet50
from keras.applications.resnet50 import preprocess_input
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model
from keras.optimizers import RMSprop
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator

The easiest way to get the data set is via fast.ai's servers, where they've graciously hosted a single zip file with everything we need. Extract this to a directory somewhere on your machine and update the "PATH" variable below, and you should be good to go. We can also specify a few useful constants such as the image dimension and batch size.

PATH = '/home/paperspace/data/dogscats/'
train_dir = f'{PATH}train'
valid_dir = f'{PATH}valid'
size = 224
batch_size = 64

Next we need a generator to apply transformations to the images. As before, we can use the generator Keras has built-in. The only wrinkle is using a specalized preprocessing function designed for ImageNet-like source data (this also comes with Keras and was imported above).

train_datagen = ImageDataGenerator(
    shear_range=0.2,
    zoom_range=0.2,
    preprocessing_function=preprocess_input,
    horizontal_flip=True)

val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)

With CIFAR 10 we had the whole data set loaded into memory, but that strategy usually isn't feasible for larger image databases. In this case we have a bunch of image files in folders on disk as our starting point, and to run a model over these images we want to be able to stream images into memory in batches rather than load everything at once. Fortunately Keras can also handle this scenario natively using the "flow_from_directory" function. We just need to specify the directory, image size, and batch size.

train_generator = train_datagen.flow_from_directory(train_dir,
    target_size=(size, size),
    batch_size=batch_size, class_mode='binary')

val_generator = val_datagen.flow_from_directory(valid_dir,
    shuffle=False,
    target_size=(size, size),
    batch_size=batch_size, class_mode='binary')

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.

For the model, we'll use the ResNet-50 architecture with pre-trained weights. ResNet-50 is a 168-layer architecture that achieved 92% top-5 accuracy on ImageNet classification. Keras provides both the model architecture and an option to use existing weights out of the box. The other notable parameter in the model initializer is "include_top", which indicates if we want to include the fully-connected layer at the top of the network. In our case the answer is no, because we want to "hook into" the model after the last residual block and add our own architecture on top.

base_model = ResNet50(weights='imagenet', include_top=False)
x = base_model.output

After instantiating the pre-trained ResNet-50 model, we can start adding new layers to the architecture. Let's start with a pooling layer to normalize the tensor shape, then add a fully-connected layer of our own. Finally, we'll use a sigmoid unit for class probability since the task is binary (cat or dog).

x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
preds = Dense(1, activation='sigmoid')(x)

Before finishing the model definition and compiling, there's one more notable step. We need to prevent the "base" layers of the model from participating in the weight update phase of training while we "break in" the new layers we just added. Since each layer in a Keras model has a "trainable" property, we can just set it to false for all layers in the base architecture.

(Aside: There is apparently some funkiness to using this approach in models that have batch norm layers that can lead to sub-optimal results, especially when doing fine-tuning which we'll get to in a few steps. I haven't seen a conclusive answer on how to deal with this, and the niave approach seems to work okay for this problem, so I'm not doing anything special to deal with it here but I wanted to point it out as a potential issue one might run into. There's a lengthly discussion on the subject here).

model = Model(inputs=base_model.input, outputs=preds)
for layer in base_model.layers: layer.trainable = False
model.compile(optimizer=RMSprop(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])

Training should be pretty familiar, the only wrinkle here is we need to specify the number of batches in an epoch when using the "flow_from_directory" generator.

history = model.fit_generator(train_generator,
    train_generator.n // batch_size, epochs=3, workers=4,
    validation_data=val_generator,
    validation_steps=val_generator.n // batch_size)

Epoch 1/3
359/359 [==============================] - 128s 357ms/step - loss: 0.1738 - acc: 0.9506 - val_loss: 0.0694 - val_acc: 0.9839
Epoch 2/3
359/359 [==============================] - 123s 342ms/step - loss: 0.0809 - acc: 0.9729 - val_loss: 0.1059 - val_acc: 0.9778
Epoch 3/3
359/359 [==============================] - 123s 344ms/step - loss: 0.0717 - acc: 0.9755 - val_loss: 0.1411 - val_acc: 0.9723

These results aren't too bad even with the entire base architecture held constant. This is partly due to the fact that the training images are quite similar to the images that the architecture was trained on. If we were fitting the model on something totally different, say medical image classification for instance, transfer learning would still work but it wouldn't be this easy.

The next step is to fine-tune some of the base model by "unfreezing" parts of it and allowing them to update weights during training. I'm not aware if there are any best practices for fine-tuning or not. I think it's generally a lot of trial and error. For this attempt, I unfroze the last residual block in the network and lowered the learning rate by an order of magnitude.

for layer in model.layers[:142]: layer.trainable = False
for layer in model.layers[142:]: layer.trainable = True
model.compile(optimizer=RMSprop(lr=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

history = model.fit_generator(train_generator,
    train_generator.n // batch_size, epochs=3, workers=4,
    validation_data=val_generator,
    validation_steps=val_generator.n // batch_size)

Epoch 1/3
359/359 [==============================] - 151s 421ms/step - loss: 0.0468 - acc: 0.9826 - val_loss: 1.0175 - val_acc: 0.9098
Epoch 2/3
359/359 [==============================] - 146s 406ms/step - loss: 0.0293 - acc: 0.9903 - val_loss: 0.1305 - val_acc: 0.9829
Epoch 3/3
359/359 [==============================] - 146s 406ms/step - loss: 0.0211 - acc: 0.9938 - val_loss: 0.1197 - val_acc: 0.9849

This technique is very powerful and is probably almost always a better idea than starting from scratch if there's a model out there that is at least somewhat similar to the thing you're trying to accomplish. Currently transfer learning is mostly being applied to image models, although it's quickly taking over language models as well.

That wraps up this post on convolutional networks. In the next post in this series we'll see how to use a deep learning framework like Keras to build a recommendation system. Don't miss it!

Curious Insight

Tags

Curious Insight

Deep Learning With Keras: Convolutional Networks

John Wittenauer

Curious Insight

John Wittenauer

Share Article