In our last post we did a total overhaul of our model, using a more appropriate neural network type and a more powerful framework. We simplified the problem but only doing a binary classification and only using two classes: our normal and our ceiling effects plots. We were able to get fantastic validation accuracy, but never checked accuracy on a test set, and never considered alternate metrics of evaluating model performance ("accuracy" is not always the most informative metric).

In this post, well create our final model that predicts all four classes, we'll evaluate its accuracy on a set of data held out from any training or validation, and look at a metric other than accuracy to give us more information about our model performance.

We start by loading mostly the same modules we did in the last post. We add `sklearn.metrics`

, for calculating Receiver Operating Characteristics (ROC) and Area Under the Curve (AUC), plus some helpful utilities from `itertools`

and `scipy`

.

In [4]:

```
import warnings
warnings.filterwarnings('ignore')
import os
from datetime import date
import numpy as np
import pandas as pd
from keras.models import Sequential, load_model
from keras.layers import Permute, Reshape, LSTM, Dropout, TimeDistributed, Dense, Activation, Flatten
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import CSVLogger, EarlyStopping, TensorBoard
from sklearn import metrics
from itertools import cycle
from scipy import interp
```

Again we define the same callbacks from the previous post, with the addition of `TensorBoard`

, which allows us to interactively explore many aspects of our model, if desired.

In [4]:

```
# keras callbacks
csv_logger = CSVLogger('epoch-log2.csv', append=True, separator=';')
early_stopper = EarlyStopping(monitor='val_loss',
min_delta=0,
patience=2,
verbose=0, mode='auto')
tensor_board = TensorBoard(log_dir='./tf-log', histogram_freq=0,
write_graph=True, write_images=False)
```

In [ ]:

```
os.chdir(os.path.expanduser('~/share/rkingdc-blog/regplot'))
```

We need to update our model to allow for the prediction of multiple classes, rather than a simple binary classification. This requires only a few changes: output nodes to 4 instead of 1, the activation of the output node to 'softmax', and changing our loss function to "categorical crossentropy".

In [6]:

```
input_dim1 = 256
lstm_size = 150
hidden_layer_size = 100
adam_parms = {'lr': 1e-4, 'beta_1': 0.9, 'beta_2': 0.999}
mod = Sequential()
mod.add(Permute((2,1,3), input_shape=(input_dim1,input_dim1,3)))
mod.add(Reshape(target_shape = (input_dim1,input_dim1*3)))
# our hidden layers
mod.add(LSTM(lstm_size, return_sequences=True))
mod.add(LSTM(lstm_size, return_sequences=True))
# dropout
mod.add(Dropout(0.5))
mod.add(TimeDistributed(Dense(hidden_layer_size), input_shape=(input_dim1, lstm_size) ))
mod.add(Flatten())
mod.add(Dense(4, activation='softmax'))
mod.compile(optimizer=optimizers.Adam(**adam_parms), loss='categorical_crossentropy', metrics=['accuracy'])
mod.summary()
```

All this is very similar as last time--we specify the `class_mode`

of the data generator to be "categorical" so that the vector of responses is correctly constructed, and I make the batch size a little larger, since we now have 4 classes in the data and I want to reduce the likelihood that a given batch won't be missing a class.

In [6]:

```
train_gen = ImageDataGenerator(rescale = 1/255)
test_gen = ImageDataGenerator(rescale = 1/255)
```

In [ ]:

```
train = train_gen.flow_from_directory('data/imgs/train2',
shuffle=True,
batch_size=50,
class_mode='categorical')
val = test_gen.flow_from_directory('data/imgs/test2',
shuffle=True,
batch_size=50,
class_mode='categorical')
```

In [ ]:

```
mod.fit_generator(train,
epochs=15,
verbose=0,
validation_data=val,
callbacks=[csv_logger, early_stopper, tensor_board])
mod.save(f'trained_model_2_{str(date.today())}.h5')
```

With our model trained, we can now use a set of images that were excluded from the training and validation stages to see how this model can perform at classifying data it has never seen before. We know from our high validation accuracy (>99%) that we are definitely able to classify the images in the validation set well, but we need to be sure we haven't overfit our model. To do this, we create another data generator and pass that to the predict method in that model to get an array of class predictions for each image in that set.

In [ ]:

```
holdout_gen = ImageDataGenerator(rescale = 1/255)
holdout = holdout_gen.flow_from_directory('data/holdout_pngs',
shuffle=False,
batch_size=50,
class_mode='categorical')
model_eval = mod.predict_generator(holdout,
use_multiprocessing=True,
workers=3)
```

I like to put the data into a pandas DataFrame to make them a little easier to work with.

In [20]:

```
preds = pd.DataFrame(model_eval, columns = holdout.class_indices.keys())
preds['filename'] = holdout.filenames
preds['truth'] = preds['filename'].apply(os.path.dirname)
preds['predicted_class'] = preds[list(holdout.class_indices.keys())].idxmax(1)
preds.head()
```

Out[20]:

Our accuracy is quite high with this model on the training set:

In [29]:

```
print(str(np.mean(preds['predicted_class'] == preds['truth']) * 100) + "% Accuracy")
```

Another useful metric in classification is the Area Under the Curve (AUC), which takes into account measures of sensitivity and specificity in the model, looking at the predicted probabilities rather than the final classes. The keras model assigns a classification based on the maximal probability, but if we want to reduce type II error, we might want to set our own thresholds rather than use keras's. AUC ranges from 0-1, with 0.5 representing a model that performs completely at chance.

In [21]:

```
def get_truths(df, class_label):
y_truth = df['truth'] == class_label
return y_truth.astype(int).values, df[class_label].values
```

In [22]:

```
# Compute ROC curve and ROC area for each class
n_classes = len(holdout.class_indices)
classes = holdout.class_indices.keys()
lw=2
fpr = dict()
tpr = dict()
roc_auc = dict()
for k,i in holdout.class_indices.items():
t, p = get_truths(preds, k)
fpr[i], tpr[i], _ = metrics.roc_curve(t, p)
roc_auc[i] = metrics.auc(fpr[i], tpr[i])
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# Then interpolate all ROC curves at these points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# average it and compute AUC
mean_tpr /= n_classes
fpr["overall"] = all_fpr
tpr["overall"] = mean_tpr
roc_auc["overall"] = metrics.auc(fpr["overall"], tpr["overall"])
```

In [139]:

```
print_auc = (lambda x,v: print('{v} AUC: {x:.6f}'.format(v=v, x=x)))
for k,v in holdout.class_indices.items():
print_auc(roc_auc[v], k)
print_auc(roc_auc['overall'], "Overall")
```

We want our AUC to be as close to 1 as possible, so that we are getting values of .999 (at minimum) is very promising. With a few adjustments to cutoff thresholds, we may be able to get a classifier with almost zero risk of letting erroneous plots slip through the cracks.

In the next post, we'll go through how to set those thresholds to reduce type II error, and future posts will transform this model into a useful tool, by making it available as a HTTP request endpoint.