Review

In our last post we did a total overhaul of our model, using a more appropriate neural network type and a more powerful framework. We simplified the problem but only doing a binary classification and only using two classes: our normal and our ceiling effects plots. We were able to get fantastic validation accuracy, but never checked accuracy on a test set, and never considered alternate metrics of evaluating model performance ("accuracy" is not always the most informative metric).

In this post, well create our final model that predicts all four classes, we'll evaluate its accuracy on a set of data held out from any training or validation, and look at a metric other than accuracy to give us more information about our model performance.

We start by loading mostly the same modules we did in the last post. We add sklearn.metrics, for calculating Receiver Operating Characteristics (ROC) and Area Under the Curve (AUC), plus some helpful utilities from itertools and scipy.

In [4]:
import warnings 
warnings.filterwarnings('ignore') 

import os
from datetime import date

import numpy as np
import pandas as pd

from keras.models import Sequential, load_model
from keras.layers import Permute, Reshape, LSTM, Dropout, TimeDistributed, Dense, Activation, Flatten
from keras import optimizers

from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import CSVLogger, EarlyStopping, TensorBoard

from sklearn import metrics
from itertools import cycle
from scipy import interp
Using TensorFlow backend.

Again we define the same callbacks from the previous post, with the addition of TensorBoard, which allows us to interactively explore many aspects of our model, if desired.

In [4]:
# keras callbacks
csv_logger = CSVLogger('epoch-log2.csv', append=True, separator=';')
early_stopper = EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              verbose=0, mode='auto')
tensor_board = TensorBoard(log_dir='./tf-log', histogram_freq=0,
                          write_graph=True, write_images=False)
In [ ]:
os.chdir(os.path.expanduser('~/share/rkingdc-blog/regplot'))

Building the Model

We need to update our model to allow for the prediction of multiple classes, rather than a simple binary classification. This requires only a few changes: output nodes to 4 instead of 1, the activation of the output node to 'softmax', and changing our loss function to "categorical crossentropy".

In [6]:
input_dim1 = 256
lstm_size = 150
hidden_layer_size = 100
adam_parms = {'lr': 1e-4, 'beta_1': 0.9, 'beta_2': 0.999}

mod = Sequential()

mod.add(Permute((2,1,3), input_shape=(input_dim1,input_dim1,3)))
mod.add(Reshape(target_shape = (input_dim1,input_dim1*3)))

# our hidden layers
mod.add(LSTM(lstm_size, return_sequences=True))
mod.add(LSTM(lstm_size, return_sequences=True))

# dropout 
mod.add(Dropout(0.5))

mod.add(TimeDistributed(Dense(hidden_layer_size), input_shape=(input_dim1, lstm_size) ))

mod.add(Flatten())

mod.add(Dense(4, activation='softmax'))

mod.compile(optimizer=optimizers.Adam(**adam_parms), loss='categorical_crossentropy', metrics=['accuracy'])
mod.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
permute_1 (Permute)          (None, 256, 256, 3)       0         
_________________________________________________________________
reshape_1 (Reshape)          (None, 256, 768)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 256, 150)          551400    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256, 150)          180600    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256, 150)          0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 256, 100)          15100     
_________________________________________________________________
flatten_1 (Flatten)          (None, 25600)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 102404    
=================================================================
Total params: 849,504
Trainable params: 849,504
Non-trainable params: 0
_________________________________________________________________

Training the Model

All this is very similar as last time--we specify the class_mode of the data generator to be "categorical" so that the vector of responses is correctly constructed, and I make the batch size a little larger, since we now have 4 classes in the data and I want to reduce the likelihood that a given batch won't be missing a class.

In [6]:
train_gen = ImageDataGenerator(rescale = 1/255)
test_gen = ImageDataGenerator(rescale = 1/255)
In [ ]:
train = train_gen.flow_from_directory('data/imgs/train2',
                                      shuffle=True,
                                      batch_size=50,
                                      class_mode='categorical')
val = test_gen.flow_from_directory('data/imgs/test2',
                                    shuffle=True,
                                    batch_size=50,
                                    class_mode='categorical')
In [ ]:
mod.fit_generator(train,
       epochs=15,
       verbose=0,
       validation_data=val,
       callbacks=[csv_logger, early_stopper, tensor_board])
mod.save(f'trained_model_2_{str(date.today())}.h5')

Testing the model

With our model trained, we can now use a set of images that were excluded from the training and validation stages to see how this model can perform at classifying data it has never seen before. We know from our high validation accuracy (>99%) that we are definitely able to classify the images in the validation set well, but we need to be sure we haven't overfit our model. To do this, we create another data generator and pass that to the predict method in that model to get an array of class predictions for each image in that set.

In [ ]:
holdout_gen = ImageDataGenerator(rescale = 1/255)
holdout = holdout_gen.flow_from_directory('data/holdout_pngs',
                                    shuffle=False,
                                    batch_size=50,
                                    class_mode='categorical')

model_eval = mod.predict_generator(holdout, 
                                    use_multiprocessing=True, 
                                    workers=3)

I like to put the data into a pandas DataFrame to make them a little easier to work with.

In [20]:
preds = pd.DataFrame(model_eval, columns = holdout.class_indices.keys())
preds['filename'] = holdout.filenames
preds['truth'] = preds['filename'].apply(os.path.dirname)
preds['predicted_class'] = preds[list(holdout.class_indices.keys())].idxmax(1)
preds.head()
Out[20]:
biased ceiling none outlier filename truth predicted_class
0 1.0 2.481814e-10 8.684403e-13 6.872439e-11 biased/biased_0000001.png biased biased
1 1.0 4.160816e-10 4.601760e-13 2.775461e-11 biased/biased_0000002.png biased biased
2 1.0 3.122757e-10 5.066882e-13 5.715443e-11 biased/biased_0000003.png biased biased
3 1.0 1.847489e-10 1.812751e-13 6.597836e-11 biased/biased_0000004.png biased biased
4 1.0 3.912477e-10 1.104792e-12 1.224962e-10 biased/biased_0000005.png biased biased

Our accuracy is quite high with this model on the training set:

In [29]:
print(str(np.mean(preds['predicted_class'] == preds['truth']) * 100) + "% Accuracy")
99.5% Accuracy

Another useful metric in classification is the Area Under the Curve (AUC), which takes into account measures of sensitivity and specificity in the model, looking at the predicted probabilities rather than the final classes. The keras model assigns a classification based on the maximal probability, but if we want to reduce type II error, we might want to set our own thresholds rather than use keras's. AUC ranges from 0-1, with 0.5 representing a model that performs completely at chance.

In [21]:
def get_truths(df, class_label):
    y_truth = df['truth'] == class_label
    return y_truth.astype(int).values, df[class_label].values
In [22]:
# Compute ROC curve and ROC area for each class
n_classes = len(holdout.class_indices)
classes = holdout.class_indices.keys()
lw=2
fpr = dict()
tpr = dict()
roc_auc = dict()
for k,i in holdout.class_indices.items():
    t, p = get_truths(preds, k)
    fpr[i], tpr[i], _ = metrics.roc_curve(t, p)
    roc_auc[i] = metrics.auc(fpr[i], tpr[i])
    
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at these points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# average it and compute AUC
mean_tpr /= n_classes

fpr["overall"] = all_fpr
tpr["overall"] = mean_tpr
roc_auc["overall"] = metrics.auc(fpr["overall"], tpr["overall"])
In [139]:
print_auc = (lambda x,v: print('{v} AUC: {x:.6f}'.format(v=v, x=x)))
for k,v in holdout.class_indices.items():
    print_auc(roc_auc[v], k)
print_auc(roc_auc['overall'], "Overall")
biased AUC: 1.000000
ceiling AUC: 0.999891
none AUC: 0.999808
outlier AUC: 0.999961
Overall AUC: 0.999931

We want our AUC to be as close to 1 as possible, so that we are getting values of .999 (at minimum) is very promising. With a few adjustments to cutoff thresholds, we may be able to get a classifier with almost zero risk of letting erroneous plots slip through the cracks.

In the next post, we'll go through how to set those thresholds to reduce type II error, and future posts will transform this model into a useful tool, by making it available as a HTTP request endpoint.