Reducing Overfitting

Last section we trained a model that performed really well on the training set but then did much worse on the testing set. The reason for that was overfitting. Overfitting is when a model learns to fit a training set so well it does not generalize on new data. Common techniques for over fitting are as follows:

How to reduce overfitting:

More samples
Add dropout
Image Augmentation
Add regularization

We cannot add more data since we are already using the entire data set. In practice, we could go get more picures of the 10 categories and reduce them to 32x32 images but we are not going to do that. Adding more samples will give the classifier a wider range of samples and therefore it will make it less likely to overfit. We will add dropout which randomly gets rid of data between layers in order to help the model generalize better. When we add more samples we said that the model would be more generalized. What if we could shift the images a bit so the model would not focus in on certain features of each image. This should help the model generalize better. This is called image augmentation. And the last thing we can do is add regularization. This reduces the level of complexity in a model helping it generalize better.

Let's load in the dataset:

In [1]:

from keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

Using TensorFlow backend.

Now lets look at the number samples in the training and testing set:

In [143]:

# From cifar10
x_train.shape, y_train.shape, x_test.shape, y_test.shape

Out[143]:

((50000, 32, 32, 3), (50000, 1), (10000, 32, 32, 3), (10000, 1))

In data science it is important to separate your dataset into not two sets like we have been (training and testing), but three datasets: training, validation, and testing. What is a validation set? It is a part of the training set that is left out to reduce overfitting. There is no set rule for how to divide training, testing, and validation sets. 60% training, 20% validation, and 20% testing is usually in the ballpark, but feel free to adjust the sizes of the splits depending on your model.

We are going to bring in a tool from the popular scikit-learn library to randomly split 20% of the training data into a validation set.

In [2]:

#change val_train to x_val and val_test to y_val.  This keeps notation same
from sklearn.model_selection import train_test_split
x_train, val_train, y_train, val_test = train_test_split( x_train, y_train, test_size=0.2, random_state=0)

In [3]:

# from train_test_split
x_train.shape, val_train.shape, x_test.shape, val_test.shape

Out[3]:

((40000, 32, 32, 3), (10000, 32, 32, 3), (10000, 32, 32, 3), (10000, 1))

We did not do this before, but it is always good to have a peek at your data before you work with it. Many data science practicioners say it is not great to look to hard though because you can start to form biases for features you think will be important in a model. The script below let's us peak at some of the data:

In [5]:

import matplotlib
from matplotlib import pyplot as plt
images = range(0,9)
for i in images:
    plt.subplot(330 + 1 + i)
    plt.imshow(x_train[i], cmap=plt.get_cmap('gray'))
#Show the plot
plt.show()

We will use the function below to make the the image data generator work the the model.

In [6]:

from keras.applications.vgg16 import preprocess_input
def preprocess_input_vgg(x):

    X = np.expand_dims(x, axis=0)
    X = preprocess_input(X)
    return X[0]

Make imports, set classes, and make the data categorical like before. Notice this time we are importing ImageDataGenerator. We will be using that for data augmentation.

In [7]:

from keras.layers.core import Dropout
from keras.layers.normalization import BatchNormalization
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.layers import Input, Flatten, Dense
from keras.models import Model
from keras.optimizers import Adam
import numpy as np
from keras import utils
from keras.utils import np_utils

num_classes = 10

y_train = np_utils.to_categorical(y_train, num_classes)
val_test = np_utils.to_categorical(val_test, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)

Now we want to augment our samples a bit in order to get the classifier to generalize better.

In [ ]:

datagen = ImageDataGenerator(
    rotation_range=25,
    width_shift_range=0.25,
    height_shift_range=0.25, 
    horizontal_flip=True,
    preprocessing_function=preprocess_input_vgg)

evalgen = ImageDataGenerator(
    preprocessing_function=preprocess_input_vgg)

We are going to add dropout between our linear layers in our model. This will cause the model to forget half of the activations that we send to the next layer. Although this seems counter-intuitive at first, it helps the model generalize. You want to add as little dropout as possible while still getting the same accuracy results from the training, validation and testing set.

In [ ]:

#Get back the convolutional part of a VGG network trained on ImageNet
#Notice this time we are not freezing the convolutional layers. This will allow the model to optimize the layers
model_vgg16_conv = VGG16(weights='imagenet', include_top=False)

#Create your own input format (here 3x200x200)
input = Input(shape=(32,32, 3),name = 'image_input')

#Use the generated model 
output_vgg16_conv = model_vgg16_conv(input)

#Add the fully-connected layers 
x = Flatten(name='flatten')(output_vgg16_conv)
x = Dense(4096, activation='relu', name='fc1')(x)
x = BatchNormalization()(x)
x = Dropout(.5)(x)
x = Dense(4096, activation='relu', name='fc2')(x) #was4096
x = BatchNormalization()(x)
x = Dropout(.5)(x)
x = Dense(10, activation='softmax', name='predictions')(x)

This time when we create our model we will want to add validation data into the fit generator. The model will not use this data to train on but will test the accuracy of this data set. If the validation accuracy is lower than the training accuracy that is a sign of overfitting.

In [148]:

#Create your own model 
my_model = Model(input=input, output=x)

epochs = 3

Adam = Adam(lr=.0001)
my_model.compile(optimizer=Adam, loss ='categorical_crossentropy', metrics=['accuracy'])

batch_size= 256
my_model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size), 
                       validation_data=datagen.flow(val_train, val_test, batch_size=batch_size), 
                       validation_steps=len(val_train)/20,
                       steps_per_epoch=len(x_train)/20, 
                       epochs=epochs)
my_model.save_weights('my_model_weights.h5')

/home/kevin/anaconda2/envs/Deeplearning/lib/python3.6/site-packages/ipykernel/__main__.py:52: UserWarning: Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("im..., outputs=Tensor("pr...)`

Epoch 1/3
2000/2000 [==============================] - 211s - loss: 0.8147 - acc: 0.7333 - val_loss: 0.5505 - val_acc: 0.8092
Epoch 2/3
2000/2000 [==============================] - 210s - loss: 0.3957 - acc: 0.8634 - val_loss: 0.5124 - val_acc: 0.8314
Epoch 3/3
2000/2000 [==============================] - 210s - loss: 0.2616 - acc: 0.9097 - val_loss: 0.5037 - val_acc: 0.8404

I only tested the dataset on 2000 samples per epoch and we can tell already that the model is overfitting much less than before. Still, the goal is to have the training and validation accuracy as close as possible so it is evident that the model is overfitting and further needs to be optimized. Hopefully now you understand the importance of hyperparameter tuning in designing architectures. Let's evaluate the model on our testing set:

In [149]:

#evaluate_generator probably is the way to go
my_model.load_weights('my_model_weights.h5')
evalgen = ImageDataGenerator(preprocessing_function=preprocess_input_vgg)
score = my_model.evaluate_generator(evalgen.flow(x_test, y_test, batch_size=256), steps=len(x_test))
my_model.metrics_names , score

Out[149]:

(['loss', 'acc'], [0.43555837689055205, 0.86739999999999995])

It looks like the model is doing much better since the scores of the training, validation, and testing set are closer. However, there is always more that can be done. We have not added l2 reguralization and we are still using VGG16 architecture. There are more modern resnets that can perform better than the VGG16 architecture. Perhaps we can give those a try!