Digit Recognition

Digit recognition

Introduction

Digit recognition is a form of multi-class classification where the inputs are images of hand written digits, and the outputs are the corresponding numbers (0-9). Classifying handwritten text or numbers is important for many real world scenarios. For example, a postal service can scan postal codes on envelopes in order to automate grouping of envelopes that should be sent to the same city. Digit recognition is one of the simplest versions of classifying handwriting, and is a good first project for getting into the field of image classification.

I have implemented a small convolutional neural network (CNN) in TensorFlow, that achieves a decent performance on the MNIST test set (≈1% classification error). MNIST is a database containing gray scale images of handwritten digits. The database contains a training set of 60000 examples and a test set of 10000 examples. Read more on http://yann.lecun.com/exdb/mnist/.

The source code can be found at https://github.com/CarlFredriksson/digit_recognition.

Implementation

Loading the MNIST Data

You could download the data from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import from keras.datasets.

from tensorflow.keras.datasets import mnist
def load_data():
    (X_train, Y_train), (X_test, Y_test) = mnist.load_data()

    return X_train, Y_train, X_test, Y_test

In order to visualize the data I plotted the first four training examples.

def visualize_data(X, Y, plot_name):
    plt.subplot(221)
    plt.imshow(X[0], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[0]))
    plt.subplot(222)
    plt.imshow(X[1], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[1]))
    plt.subplot(223)
    plt.imshow(X[2], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[2]))
    plt.subplot(224)
    plt.imshow(X[3], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[3]))
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

Visualize data

Preprocessing

A standard prepropcessing step is to normalize the inputs, thus we divide all pixel values by 255 to put them in the (0-1) range. In the output layer we are going to use the softmax activation, thus we need to convert the outputs from (0-9) to one hot vectors. A one hot vector is a vector with a single 1 and 0 in the other positions. For example, the one hot vectors for 0, 3, and 9 are $[1,0,0,0,0,0,0,0,0,0]$, $[0,0,0,1,0,0,0,0,0,0]$, and $[0,0,0,0,0,0,0,0,0,1]$ respectively. Finally, since we are building a CNN, we also want to add a channels dimension to the input. We only need one channel since the images are in grayscale.

def preprocess_data(X_train, Y_train, X_test, Y_test):
    # Normalize image pixel values from 0-255 to 0-1
    X_train = X_train / 255
    X_test = X_test / 255

    # Change y values from 0-9 to one hot vectors
    Y_train = convert_to_one_hot(Y_train)
    Y_test = convert_to_one_hot(Y_test)

    # Add channels dimension
    X_train = np.expand_dims(X_train, axis=3)
    X_test = np.expand_dims(X_test, axis=3)

    return X_train, Y_train, X_test, Y_test
def convert_to_one_hot(Y):
    Y_onehot = np.zeros((len(Y), Y.max() + 1))
    Y_onehot[np.arange(len(Y)), Y] = 1

    return Y_onehot

To significantly speed up the training process, we are going to divide the training set into mini batches. Instead of doing one step of gradient descent after processing the whole training set, we are going to do one gradient descent step after each mini batch. I used a mini batch size of 200, which gives us 300 mini batches.

def random_mini_batches(X_train, Y_train, mini_batch_size):
    mini_batches = []
    m = X_train.shape[0] # Number of training examples

    # Shuffle training examples
    permutation = list(np.random.permutation(m))
    X_shuffled = X_train[permutation]
    Y_shuffled = Y_train[permutation]

    # Partition into mini-batches
    num_complete_mini_batches = math.floor(m / mini_batch_size)
    for i in range(num_complete_mini_batches):
        X_mini_batch = X_shuffled[i * mini_batch_size : (i + 1) * mini_batch_size]
        Y_mini_batch = Y_shuffled[i * mini_batch_size : (i + 1) * mini_batch_size]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)

    # Handling the case that the last mini-batch < mini_batch_size
    if m % mini_batch_size != 0:
        X_mini_batch = X_shuffled[num_complete_mini_batches * mini_batch_size : m]
        Y_mini_batch = Y_shuffled[num_complete_mini_batches * mini_batch_size : m]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)

    return mini_batches

Creating the Model

Now it is time to create the CNN model. To have a short training time I chose a small model with a single convolutional layer. The simple model yields pretty decent results (≈1% classification error), but deeper models can perform even better. A list of the best models can be found at http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html.

The model can be summarized as: convolutional–>max_pool–>dropout–>flatten–>dense–>dense–>softmax. The dropout layer is used to reduce overfitting. The dense layers are fully connected layers. Note the placeholder training_flag which will decide when the dropout should be applied. We only want dropo`$ut applied when training, thus the placeholder is set to false as a default.

def create_model(height, width, channels, num_classes):
    tf.reset_default_graph()

    X = tf.placeholder(dtype=tf.float32, shape=(None, height, width, channels), name="X")
    Y = tf.placeholder(dtype=tf.float32, shape=(None, num_classes), name="Y")
    training_flag = tf.placeholder_with_default(False, shape=())

    conv1 = tf.layers.conv2d(X, filters=32, kernel_size=5, strides=1, padding="same", activation=tf.nn.relu)
    pool1 = tf.layers.max_pooling2d(conv1, pool_size=2, strides=2, padding="valid")
    # Dropout does not apply by default, training=True is needed to make the layer do anything
    # We only want dropout applied during training
    dropout = tf.layers.dropout(pool1, rate=0.2, training=training_flag) 
    flatten = tf.layers.flatten(dropout)
    dense1 = tf.layers.dense(flatten, 128, activation=tf.nn.relu)
    Y_hat = tf.layers.dense(dense1, num_classes, activation=tf.nn.softmax, name="Y_hat")

    # Compute cost
    J = compute_cost(Y, Y_hat)

    return X, Y, training_flag, Y_hat, J

To compute the cost I used the standard softmax cost function. Let $n_c$ be the number of classes ($n_c=10$ in our case). For a single training example with correct one hot vector $y$ and softmax output $\hat{y}$, the cost is:

$$ J(y,\hat{y}) = -\sum_{i=1}^{n_c} y_i \log{\hat{y}_i} $$

To compute the full cost we average over all training examples in the current mini batch. Note that a very small value is added to logarithm inputs, to avoid taking the log of 0.

def compute_cost(Y, Y_hat):
    # Add small value epsilon to tf.log() calls to avoid taking the log of 0
    epsilon = 1e-10
    J = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(Y_hat + epsilon), axis=1), name="J")

    return J

Training the Model

To train the model I used an Adam optimizer with a learning rate of 0.001. I trained for 10 epochs. After training, the graph and variables are saved as a TensorFlow SavedModel. Classification accuracy on the training and test sets are computed by dividing the correctly classified examples by the size of the data set. The model achieves a test set accuracy of about 99%, or in other words 1% error. Error is often used instead of accuracy when comparing models with very high accuracy.

def run_model(X, Y, training_flag, Y_hat, J, X_train, Y_train, X_test, Y_test, mini_batches, LEARNING_RATE, NUM_EPOCHS):
    # Create train op
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for epoch in range(NUM_EPOCHS):
            for (X_mini_batch, Y_mini_batch) in mini_batches:
                _, J_train = sess.run([train_op, J], feed_dict={X: X_mini_batch, Y: Y_mini_batch, training_flag: True})
            print("epoch: " + str(epoch) + ", J_train: " + str(J_train))

        # Final costs
        J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
        J_test = sess.run(J, feed_dict={X: X_test, Y: Y_test})

        # Compute training accuracy
        Y_pred = sess.run(Y_hat, feed_dict={X: X_train, Y: Y_train})
        accuracy_train = compute_accuracy(Y_pred, Y_train)

        # Compute test accuracy
        Y_pred = sess.run(Y_hat, feed_dict={X: X_test, Y: Y_test})
        accuracy_test = compute_accuracy(Y_pred, Y_test)

        # Save model
        tf.saved_model.simple_save(sess, "saved_model", inputs={"X": X, "Y": Y}, outputs={"Y_hat": Y_hat})

    return J_train, J_test, accuracy_train, accuracy_test
def compute_accuracy(Y_pred, Y_real):
    Y_pred = np.argmax(Y_pred, axis=1)
    Y_real = np.argmax(Y_real, axis=1)
    num_correct = np.sum(Y_pred == Y_real)
    accuracy = num_correct / Y_real.shape[0]

    return accuracy

Testing on My Own Handwritten Image

I made a simple black and white image in paint.

Handwritten image

The image needs to be loaded and preprocessed. The preprocessing involves inverting the colors, since all of the images in MNIST have a black background. The image also needs to be resized to (28,28) which is the size of the MNIST images. As before we divide by 255 to normalize input values, and add a channel dimension. This time we also add a dimension to the start, since the model expects sets of input images.

def load_img(path):
    img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)

    return img
def preprocess_img(img, size, invert_colors=False):
    if invert_colors:
        img = cv2.bitwise_not(img)
    img = cv2.resize(img, dsize=size, interpolation=cv2.INTER_CUBIC)
    img = img / 255
    img = np.expand_dims(img, axis=0)
    img = np.expand_dims(img, axis=3)

    return img

To test the trained model on my own image I loaded the graph and variables from the SavedModel in another session. To do a prediction we need to get the value of the output Y_hat/Softmax:0, with our handwritten image as input X:0. The value of Y:0 is irrelevant at this point, we just need to provide something that fits the expected dimensions. The index :0 is added to names by TensorFlow to avoid duplicates.

# Start session
with tf.Session() as sess:
    # Load model
    tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], "saved_model")

    # Predict digit for test image
    y_pred = sess.run("Y_hat/Softmax:0", feed_dict={"X:0": img, "Y:0": Y_train})[0]
    y_pred = np.argmax(y_pred)
    print("Predicted digit: " + str(y_pred))

As expected the output is Predicted digit: 2.

Conclusion

I am satisfied about how this project went, and it impressed me that such a simple model can get such decent results. In fact, a simple feedforward neural network (FNN) can also achieve pretty respectable results on MNIST. The CNN model I showed is slightly more complex than the small FNN I tried, but I chose the CNN since the classification error was lower (the FNN got about 2%) and the training time was still very fast.

The source code can be found at https://github.com/CarlFredriksson/digit_recognition.

Thank you for reading, and feel free to send me any questions.