Image Recognition with Convolutional Neural Network in Tensorflow

11 min readNov 18, 2021

Hello everyone, today we are going to look at machine learning model on image recognition, more specifically convolutional neural network(CNN). This maybe not the usual stuff that I have written previously where I normally covers web development topic, but hey this is my attempt on writing machine learning related articles so feel free to let me know if I am wrong about a particular area.

Let’s talk about and go through the normal procedure when trying to work on a real life machine learning problem.

Framing the problem

What problem do we want to solve? For our case, we are tasked to build a machine learning model so that it can predicts the correct label based on the input image.

What is Image Recognition?

In a generalized term, image recognition is no different than what we are trying to solve in machine learning where we feed in data in hope that our machine learning will learn a pattern and will ultimately be able to make a prediction based on the data provided.

There’s is a few different ways to tackle image recognition, in this article we will look at CNN(convolution neural network). Before that we will go through a normal machine learning procedure.

Data Preparation

A golden rule of thumb is to always assume the data you have is dirty. Try to inspect the dataset and analyze it before building a model. In real world data there will always be some situation such as missing values, imbalance dataset, and so on, by doing analysis on the dataset you could be better prepared when you know what you are dealing with.

We wanted to focus on tensorflow convolutional neural network for now so we will be using the popular image dataset Fashion MNIST[1]. The data has been labelled and split into training and testing set, and the data is pretty well balanced so we could easily jump into the actual model building/training.

In our notebook, we have plotted out the dataset with T-SNE, from the visualization we know that some of the label are not linearly separable. Hence CNN will be the first choice to tackle this problem. T-SNE is expensive to compute and it may take a long time to process, an alternative is to use UMAP to visualize the dataset. I have included an example on Kaggle[2] in case that you are interested.

TSNE visualization on Fashion MNIST dataset

Checkout the final notebook for some analysis on the dataset.

Model Building

What do you need to start?

The easiest way to start is to use Google Colab[3], it will have most of the library you need installed and you can even harness the TPU/GPU greatness if you do not have your own GPU. For smaller projects, I definitely recommend using Google Colab.

If you want to run the training on your local machine, you will need to install Anaconda[4].

Another option is to run your notebook on the cloud (AI Platform on GCP, or EC2 DLAMI on AWS)[5][6] where you could configure your own virtual machine to have the specs that you need for your usage, just to take note that you may be charged based on your usage, so use it wisely.

Let’s Start

Why use CNN on Images

Main characteristics that makes convolutional neural network the right choice for image classification problem are feature localization and feature independence of location. Where it considers the context in the small neighborhood/region. This is extremely important for images. Convolutional network also resulting in fewer parameters to learn comparing to normal feed forward network which reduce the chance of overfitting

Now that we have went over the advantages of the CNN model, let us try to build one now and I will explain along the way.

Pretty easy right? Another way to achieve similar result is to use add function:

Pretty straight forward so far. The code snippets achieve the same result in building a CNN model but with different methods. For me personally I prefer the add function. Think of it as we are stacking layers upon layers to build your model.

Sequential vs Functional

There is two different Keras API for building tensorflow model, sequential API and functional API. For our problem today, sequential will be enough for our usage. You may consider to use Functional API when you want more control over your layers in the networks for example multiple inputs or multiple outputs. One example maybe a model such as RNN when you want to have some sort of “state” in your model.

Conv2d Layer

Next we stacked a convolutional layer at line 2. In simple term, convolutional layer perform edge detection on the input image. Image convolution is applying a filter that adds each pixel value of an image to its neighbors, weighted according to a kernel matrix. Doing so alters the image and can help the neural network process it. The main idea is that when a pixel is similar to all its neighbors, they should cancel each other (resulting in value of 0), Therefore, the more similar the pixels, the darker the part of the image, and the more different they are the lighter it is.

Let’s try to apply a kernel to an image, left is the image without the kernel, and right is the result of image with pronounced edges.

MaxPooling/AveragePooling Layer

Processing image data can be computationally expensive, hence a pooling layer/maxpooling layer is added. Pooling layer reduce the spatial size of input image by sampling from regions with a kernel. Compressing an image to extract the dominant feature of an image.

Fully connected layer/Hidden layers

In our case we added 2 fully connected layer, by adding activation function we can have a good way to have the neurons to learn non-linear combinations. Without it, it can be squeeze back into a shallow network aka linear function. For image classification problems normally we will use relu and tanh, just remember each activation functions has their advantage and weakness, best to try out each one during hyperparameter tuning phase. [7]

compile

Now that we have our base model, we will create the model with compile() function in tensorflow. When calling the function, we need to supply the following:

Optimizer: Optimizers are the expanded class which includes method to train your model. I normally choose Adam as a starter, then try out different optimizer during hyperparameter tuning phase.

loss: We will use build in function “categorical_crossentropy” for multi-class classification problem, for different problems you should use the appropriate loss function to obtain the best result. You can also implement your own loss function if you desired, but this is a topic for another day.

metrics: This is the evaluation metrics that you would like to use during training. We select accuracy for classification problems.

callbacks: Notice that we did not use this for our training. Callback is a really powerful API where you can control the model to perform certain actions during training. For example, we can setup Early stopping so that when the model starts to overfit before the training ends we can stop the training.

If you are interested in the math behind the convolutional neural network, I recommend CS231n from Stanford University[8]. I have also provide some other links in the reference area so feel free to check them out. [9] [10]

Train the model

Now that we have created the model, let’s have a look at the model summary.

Model summary basically provides you with the parameter information for your model, we have 710,474 parameters to train. For now, you can ignore the non-trainable params. This is useful only when you are doing transfer learning where you are using a pre-trained model and you are only interested in training for specific feature. This can be achieve by freezing the top layers of pre-trained model and only let it to train the lower layers.

To start the training of the model, you need to issue the following command

hist = model.fit(x=train_X, y=train_Y, validation_data=(test_X, test_Y), batch_size=512, epochs=30)

We provide the training data and labels along with validation data and labels, along with batch_size and epochs. Batch size is where the training will be done by slicing the data into batch_sizes and repeatedly iterating over the dataset for a given number of epochs.

After 30 epochs our validation accuracy reached 89.7%. Not too shabby, but there is still a lot of room to improve the model. I will leave that to you to find a better solution to this problem.

Once that we are done, we can save our model as h5 format. So the next time when you want to reuse the trained model, you just need to load in the weight and use it straight away without the need of retraining the model again.

model.save_weights('./my_cnn_model.h5', overwrite=True)

What’s Next?

Now that we have a base model that works relatively well for our problem, let’s talk about a few tips and tricks to fine tune your model.

Notice we only split the data once into test/validation set. A better way to test your robustness of your model is to run training and evaluation multiple times with cross fold validation [11][12], the idea is similar to what we were doing but instead of doing it once we will split the data randomly into test and validate set and ran the test K times. By doing that, we can ensure the robustness of our model in facing different set of image data.

You can also work out confusion matrix so that you can evaluate in which classes your model is struggling with. With that you can come up with the False Positive and False Negative rate. Accuracy is not always the stuff you are chasing for. I will include some evaluation metrics[13] in the notebook but I will not dive deep into this today.

How should I design my model for image recognition problem.

Always start with simple model, then start adding different layers(dropout layers, conv2d layers, batch normalization layers etc…) to increase the complexity of your model. You can also design your model based on existing model design, LeNet, AlexNet, VGGNet, just take note that a complex model will require your data to be complex and large enough data to support it, else you will be facing overfitting. This is the common bias/variance trade of where will not touch on today, but probably in the near future.

How to find the best hyperparameter for your model?

Once you have a model that you are happy with as a baseline, the next step is to fine tune your model. Here are a few ways that you can do to find the hyperparameter for your model.

Manually. You can definitely try to tune it with manual trail and error process. Start with some hyperparameter and use it as a starting point, then observe the result and then try out new hyperparameter combinations to try and beat the existing ones.
GridSearch. List out a grid of hyperparameter values then train each and every single combination available. This approach can be quite inefficient if your model training time is long and you want to test a bunch of hyperparameter values.
RandomSearch. set up a grid of hyperparameter values and select random combinations to train the model and score. The number of search iterations is set based on time/resources.
Bayesian Optimzation. Unlike Grid Search and Random Search, Bayesian Optimization takes into account information from past trials to select parameters for future trials.

How to speed up training?

We all know that training your model could take up a lot of your valuable time, so if you want to speed up your model training time, you could consider the following:

GPU: In case you did not notice, you can train your model with your own GPU, should work with both AMD and NVIDIA[14] with some drivers installed.

That said if you want to have more computational power you can always use Google cloud platform’s AI platform to configure your notebook environment with NVIDIA GPU. Just bear in mind that you may need to pay for your usage on Google cloud.

TPU: If you do not own a GPU then another way is to rely on TPU, which is available for free on Google Colab, though you only get access to 1 core with free tier, but based on my experience it is still faster than training on your laptop’s cpu (3x faster from my personal experience).

So you would like to try out the the different hardware acceleration you could start with google colab, it is actually pretty easy to setup. Open up notebook settings and a few lines of codes, you can train your model with either GPU or TPU on google colab.

More advanced way of speeding up your training is to consider the use of distributed training.

What if a have more complex problem?

As you progress into more complex model or when you only have a small dataset on hand, transfer learning will greatly increase the efficiency of model training, where you use a pre-trained model (VGGNet, MobileNet, ResNet etc) [15] to solve your specific problem, instead of retraining the whole model, what you need to do is to freeze the convolution layer and train the hidden layers or new layers you added for the model without updating the weights of the pretrained model layers [16]. This will allow the new output layers to learn to interpret the learned features of the pretrained model.

You can also think about data augmentation[17] to increase the diversity of your dataset, the augmentation includes rotation, zoom, crop and much more, feel free to check out the tensor flow documentation for more information.

Remember to always try out different things and come up with different approach and architecture for different problems you are trying to solve. For example, what to do if you have exploding gradient? Try batch normalization. Machine learning is not always about building the highest accuracy model but to try and find the most suitable way to overcome the problem that you are trying to solve.

Conclusion

So in today’s article, we have demonstrated a simple convolutional neural network for image recognition and some additional tips and tricks. I will continue to touch a little bit more on machine learning related article and hopefully you will enjoy them as much as I do. I hope you learn something new today and see you next time.

Please kindly find the final notebook link below: