Our last classifier was very poor--it operated at chance--a coin flip would have had the same predictive power. A few things may have been going on that caused us to find no signal. I could be we down sampled our images too much and lost useful information, it could be that our model was poorly configured (it was), it could be we were using the wrong model (we were), or it could be all of these. To address all of these issues, we'll spend a little more time constructing model this time--examining the underlying construct of the data itself, what we really want to learn from it, and how best to model that.

Convolutional neural networks are commonly used in image classification, but in this particular case, the strength of CNNs is something we want to avoid--namely, that the feature location within the actual image is irrelevant. In our case, where a specific feature shows up in our plot images is very important in classifying if it came from good or bad underlying data. For this reason, we'll go with a different type of neural network, a type of recurrent neural network (RNN) called a "Long short-term memory" neural network (LSTM). This network structure allows for a time dimension, and remembers that dimension over the course of training. Although our data don't have a time dimension, we can pretend that one dimension of our image, in this case the one that is representing by the X axis of the plot, is "time", and pass the dimension representing Y and red/blue/green intensity as out other dimension. This will allow our neural network to take into account how various features (say, a cluster of outlying points) relates to the rest of the image.

We'll use keras for training this model, since we need more fine-tuned control over our model creation. This has TensorFlow as a backend, and allows us to run these computations on a GPU, which vastly reduces training time.


We start by importing a few python modules we'll need. warnings and os for logistics, and the keras modules for building the model, loading and processing data, and logging results.

In [1]:
# keeps warnings from printing for nicer blog output
import warnings 

import os

import numpy as np

from keras.models import Sequential, load_model
from keras.layers import Permute, Reshape, LSTM, Dropout, TimeDistributed, Dense, Activation, Flatten
from keras import optimizers

from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import CSVLogger, EarlyStopping
Using TensorFlow backend.

These "callbacks" will be used during training to tell keras to write statistics about training progress to a disk and to halt training if our validation loss starts to decrease (a sign we've had enough training and we're starting to overfit our model).

In [ ]:
# keras callbacks
csv_logger = CSVLogger('tf-log/epoch-log.csv', append=True, separator=';')
early_stopper = EarlyStopping(monitor='val_loss',
                              verbose=0, mode='auto')
In [2]:

Building the Model

We start by designing our model. The input_dim1 is the dimension of our images--256 pixels. Since one axis will be our time axis, we need to know this value to make sure out model knows that we have 256 time steps for each of the 256*3 pixel values.

lstm_size and hidden_layer_size are the number of nodes in our LSTM and fully-connected neural network layers. These values were determined by making informed guesses based on experience.

The adam_parms are parameters for our optimizer. All are the default except the lr (learning rate), which is a slightly smaller step than the default, adjusted based on some initial trainings that had unstable validation loss.

In line 8 and 9, we add layers that serve only to manipulate the tensors going into the network. The Permute swaps dimension 1 and 2 (think Y and X), so that X can be used as the time and Y and Z can be flattened, and Reshape combines the tensor that was 256*256*3 to one that is 256*768 (because these LSTM models expect 2D tensors).

In line 12 and 13 we add two LSTM layers. The return_sequences=True argument ensures the time dimension weights are passed forward to the next layer in addition to the weight from each neuron.

Line 16 we add a Dropout layer, which helps prevent overfitting by randomly setting 50% of input units to zero.

In line 18, we add a "time distributed" fully connected layer, which can be conceptualized as a group of standard fully connected layers, with one for each time step.

In line 20, we flatten the output from that hidden layer into a single vector, and the pass it to the output layer in line 22.

Finally we compile the model so that it is ready to be run.

In [3]:
input_dim1 = 256
lstm_size = 150
hidden_layer_size = 100
adam_parms = {'lr': 1e-4, 'beta_1': 0.9, 'beta_2': 0.999}

mod = Sequential()

mod.add(Permute((2,1,3), input_shape=(input_dim1,input_dim1,3)))
mod.add(Reshape(target_shape = (input_dim1,input_dim1*3)))

# our hidden layers
mod.add(LSTM(lstm_size, return_sequences=True))
mod.add(LSTM(lstm_size, return_sequences=True))

# dropout 

mod.add(TimeDistributed(Dense(hidden_layer_size), input_shape=(input_dim1, lstm_size) ))


mod.add(Dense(1, activation='sigmoid'))

mod.compile(optimizer=optimizers.Adam(**adam_parms), loss='binary_crossentropy', metrics=['accuracy'])
Layer (type)                 Output Shape              Param #   
permute_1 (Permute)          (None, 256, 256, 3)       0         
reshape_1 (Reshape)          (None, 256, 768)          0         
lstm_1 (LSTM)                (None, 256, 150)          551400    
lstm_2 (LSTM)                (None, 256, 150)          180600    
dropout_1 (Dropout)          (None, 256, 150)          0         
time_distributed_1 (TimeDist (None, 256, 100)          15100     
flatten_1 (Flatten)          (None, 25600)             0         
dense_2 (Dense)              (None, 1)                 25601     
Total params: 772,701
Trainable params: 772,701
Non-trainable params: 0

Pre-processing Data

Keras includes robust facilities for preprocessing images and feeding them into models. Here we define generators that will apply a transformation (rescale all pixel values by 1/255). Often more transformations are used that stretch and warp the images, which helps reduce overfitting, but for this specific problem, we know that all images coming in will be a certain size and shape, and warping them may change the perceived underlying construct. Example, a scatter plot of normal data could be warped in such a way that it looks more like data with non-normal residuals. We want to avoid that.

For this model, we'll just see if we can differentiate between plots that were normal and plots that showed ceiling effects. The data/imgs/train and data/imgs/test directories each contain child directories with normal and ceiling effect plots in them. Creating image generators allow the image data to be prepared in tandem with the training of the model, utilizing CPU resources to generate image data and training passing the training off onto the GPU, keeping memory usage much lower than the model we ran with sklearn in the last post.

In [4]:
train_gen = ImageDataGenerator(rescale = 1/255)
test_gen = ImageDataGenerator(rescale = 1/255)
In [5]:
train = train_gen.flow_from_directory('data/imgs/train',
test = test_gen.flow_from_directory('data/imgs/test',
Found 16000 images belonging to 2 classes.
Found 4000 images belonging to 2 classes.


Finally, with our model defined and our data generators created, we can start the training process. Running the code below will run for at most 15 rounds of training (called epochs), or until our early stopping criterion is satisfied.

Then we save out the model so that it can be used for classification later.

Finally, using that model we run a classification on all our testing data to determine an overall accuracy.

In [6]:
       callbacks=[csv_logger, early_stopper])
<keras.callbacks.History at 0x7f031ccea9b0>
In [7]:
from datetime import date'trained_model_1_{str(}.h5')
In [8]:
model_eval = mod.evaluate_generator(test, use_multiprocessing=True, workers=2)
['loss', 'acc']
[0.02431063937046565, 0.9917499966025353]

Our accuracy is at 99%! That's a lot better than that chance accuracy we were seeing before.


This round of training was still preliminary, we still want to include all the different types of erroneous plots in our model and make sure we can accurately classify all of them. So we're not quite ready to start using this model as our final classifier or putting it into a production setting. But, these results are very promising: clearly, by using the correct type of model we are able to differentiate plots with ceiling effects and plots without with a very high degree of accuracy.

In the next post, we'll see the small adjustments needed for the model to classify our four types of plots, and look into tools we can use to evaluate how our model is performing.

Future posts will cover implementing the final trained model into a webservice on our local network, so anyone within our subnet can simply send their plots off to the service and get classifications back, avoiding the need for those hundreds of hours of manual classification.