Bored of MNIST? Let’s build your own OCR deep learning computer vision AI using Microsoft CNTK with EMNIST (Step by step guide)


Guest post by Chih Han Chen , Microsoft Student Partner from Imperial College London.

clip_image002

I am currently a second year PhD student at Imperial College London. My research is mainly on expert systems and artificial intelligence for personalized decision based on genetics. I am interested in the application of informatics, big data, machine learning, data value chain and business modelling.

My LinkedIn profile link.

My GitHub for this project.

Overview

clip_image003

picture source from: [1]

Following my previous post, “Build your first deep neural network with Microsoft A.I. tool CNTK (Step by step guide)” from here, we would like to move on to something more advanced on deep learning. (If you have not read the previous blog and having little background I would suggest you to have a quick look.) In this blog, we are going to implement a computer vision model called optical character recognition (OCR) with a step-by-step guide. There are two main branches for deep learning neural network: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Our model structure is a CNN using CNTK with a latest dataset called EMNIST (released April. 2017). This blog is an implementation guide focusing on programming. However, theories will be attached as much as possible in links.

Introduction to OCR

Optical Character Recognition (OCR) is a field of research in pattern recognition, computer vision and artificial intelligence. It is used to capture texts from scanned documents or photos.

clip_image004

picture source from: [2]

OCR is used as an information entry from data records such as printed paper, hand writing, signs, photos and etc. Among emerging applications, smart phone camera is one of the most frequently used input and OCR models are utilized as part of app functions. For example the following figure shows a translation app with camera.

clip_image006

Picture source from: [3]

Digitising the analogue world with electronic sensing and electronically edit pre-processing are usually required before computer vision recognition. For example, after using a camera, taking a picture and converting the data into the specific format usually have to be done before feeding to our model. However, we are safe from this step since we have a processed dataset which will be introduced in the next section.

Introduction to EMNIST dataset

The Extended Modified NIST (EMNIST) dataset, derived from NIST Special Database 19 is a set of handwritten character digits of [0-9], [a-z] and [A-Z]. Images of those handwritten characters are converted to 28x28 pixel as shown in the figure below. The EMNIST dataset takes the format of “Resized and Resampled” as in (e) of the figure.

clip_image008

Picture source from: here [4]

This dataset is designed as a more advanced replacement for existing neural networks and systems. There is a paper introducing this dataset, explaining the conversion process for creating the images see reference [1]. There are different parts within the dataset that focus only on numbers, small or capital English letters. In this blog we take the most challenging task that involves all numbers (10), all small (26) and capital (26) English letters to create a realistic computer vision OCR in English environment.

Setting your machine up

Before we jump to coding, let’s set your environment up.

Key things to be installed:

1. Python

2. openmpi-bin (for CNTK to work with your machine)

3. CNTK

4. Other packages (for this task)

For python installation, check the reference for Windows or reference for Linux

Or you can install the Data Science Virtual Machine from Microsoft which has all these tools including CNTK preinstalled.

clip_image010

All the above tools and services are preinstalled on the Microsoft Data Science VM on Windows 2012, 2016, CentOS or Ubuntu

Learn more about the DSVM Webinar Link: https://info.microsoft.com/data-science-virtual-machine.html

More Product Information: Data Science Virtual Machine Landing Page
Community Forum: DSVM Forum Page

For the openmpi-bin and CNTK, if you are using Linux you can follow either my GitHub guide or the official guide. If you are using windows please follow the official guide.

For other packages, check and install from links: ‘matplotlib’, ‘numpy link1’, ‘numpy build from source’. Furthermore, some libraries such as ’time’, ‘sys’, ‘os’, ‘__future__’, ‘urllib’, ‘zipfile’, ‘csv’, ‘re’ are assumed to be built in. Note that compared to the previous post we have used three more packages ‘struct’, ‘shutil’, ‘gzip’.


Task for this blog

In this blog, we will first download the dataset and then train a CNN model to perform recognition of English words + numbers. We will evaluate our model by the end of the post. You can simply download the codes from my Github and execute on the terminal by typing “python file_name.py” or follow the step-by-step guide as follows.

Start python by typing in “python” on your terminal or open Jupyter by typing in “jupyter notebook”, then a browser will pop up, select “New -> Python”. In this blog the print screens are from jupyter notebook.


Data Download

Before we create our model, let’s first download the EMNIST dataset with the following codes.

clip_image012

Then all the required packages are imported. If you stuck with error of package does not exist, with Linux machine you can simply open another terminal and type “sudo pip install (package name)” to download the package.

clip_image014

Here we define a function to report the progress of our download process.

clip_image016

The dataset is directly downloaded to the /tmp/ location with the above codes, if you are using windows please change the path to your desired location(replace /tmp/)

clip_image018

This dataset is relatively large on Jupyter notebook. Therefore a warning may pop up. Ignore this warning and wait! (grab a coffee maybe?)

clip_image020

After finishing the download process, let’s unzip the files with above lines.

clip_image022

Here, we load the data at the same time check the formats.

clip_image024

We read out the labels and check the formats.

clip_image026

Finally, we define a try_readout function to perform loadData and loadLabels, while stacking them together.

clip_image028

After the function definitions, we specify the location of data. However, since we have already downloaded the files, the url is set to a path of the machine we are using. Note that if you are using Windows, please change the path “/tmp/” to your own path.

clip_image030

You can check out the datasets and labels with the above codes. (The image is rotated and reversed by default)

clip_image032

clip_image034

After checking the data, we can save them into a txt format using the above codes (may take a while since data is relatively large)

Build our model

After downloading our dataset and saving them into a txt format, we are now ready to build and train our CNN model.

clip_image036

Import the required packages with the above codes.

clip_image038

We define the input shape as 28x28 and the output as 62 nodes, which represent the 26 capital letters+ 26 lower case letters + 10 numbers (0 to 9). Furthermore, we set the file paths for training and testing. In the following sections, we define functions for reading data and creating model.

clip_image040

For ease of latter use, reader is predefined with the above codes to create a figure to describe the location and the use of the function.

clip_image042

Model is created with the above codes having 2 layers of convolution and a dense layer for the output.

clip_image044

We set our model with the name “z”, the input of the model is set to x which has the shape of (1,28,28) while y is set to the shape of 62 (all letters and numbers). With the following code we can print out the shape of all layers within our CNN model.

clip_image046

For the reason that we set our strides into (2,2) and filters being 8 and 16. The data of the first dimension varies through input to output from 1, 8 to 16, while second and third dimensions changes from 28, 14 to 7. Finally, the last layer is fully connected to the 62 output nodes. To help understanding the shape and layers, see the figure below.

clip_image048

A good description to understand this can be found on CNTK official tutorial or Here.

Be aware of the fact that the number of nodes in each layer has to be integer, and if not error will occur.

Train our model

In the following sections we will first define function and then train our models.

clip_image050

With the above code, we take the build-in function cross entropy with softmax and classification error to measure the loss and error. The detail of the functions can be found here.

clip_image052

Moving average is defined with the above function. For the detail explanation please check here.

clip_image054

We report the training process results of each minibatch with the above function.

clip_image056

Here we define the training function using build-in and previously defined functions. In the code above, a test step is included in the training function to report how well our model has achieved. The training execution takes batch by batch which means we update our parameter each time based on observing a subset of our training data. Detail of batch learning can be found here.

clip_image058

We wrap up all steps with above function. First the model is created with create_model. Then after reading in training and testing data with create_reader, the model executes training with train_test.

clip_image059

Start executing the model creation and training. We have reached average test error 18.44%, which means 81.56% accuracy in average. Can we improve this?

clip_image061

With introducing two max pooling layers into our CNN model, let’s try it again. Details about Max Pooling layer can be found here.

clip_image062

The executed result shows a bit of improvement to 82.03%. Since this classification task with EMNIST dataset is much harder than tasks with simple MNIST, we would not be surprise receiving this accuracy. However, improvement can be done by playing around with the parameters or introducing extensions or other architectures. Let’s leave our model with 8X% accuracy and dive into manually evaluations by directly looking into the dataset.

Manual Evaluation

Before we start looking into the output, let’s first introduce a layer of softmax onto our output.

clip_image064

clip_image066

We use the previously defined function to create data readers for evaluation. In here, we load a subset of data with size of 25.

clip_image068

We load the data and change the shape into proper input format (1, 28, 28).

clip_image070

We execute prediction with our model using for loop. The predicted result is store in “pred”, while we load the correct result in ”gtlabel”. Let’s take a look of the results with print function.

clip_image072

It is also possible to show how the actual images look like with the following functions.

clip_image074

clip_image076


Summary and Extension:

In this blog, the main purpose is to demonstrate the performance of CNN model with CNTK using the latest dataset EMNIST. As a summary, we have demonstrated the CNN model capability on the EMNIST dataset achieving at least 80%. This dataset can be used to train practical OCR tools for English letters and numbers. In fact a fun extension of this blog is to implement smart phone apps. Or you can even use your own handwriting images onto the trained model following the last section “Manual Evaluation” of this blog. Other architectures of neural network such as RNN(LSTM), CRNN, or even other branches of machine learning can also be applied here to improve accuracy. Just try it out and have fun. A lot of other interesting projects can be found in tutorials and examples on the CNTK official website.

Reference images:

[1] https://i2.wp.com/gacomputing.info/wp-content/uploads/2017/01/Cybernetics.jpg?resize=520%2C245&ssl=1

[2] https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Portable_scanner_and_OCR_%28video%29.webm/300px--Portable_scanner_and_OCR_%28video%29.webm.jpg

[3] https://solidgeargroup.com/wp-content/uploads/2016/08/google-translate-ucretsiz-ceviri-mobil-de-destekliyor.jpg

[4] https://www.jiqizhixin.com/data/upload/ueditor/20170223/58ae90277d7bd.png

Reference links:

[1] https://arxiv.org/pdf/1702.05373v1.pdf

Comments (0)

Skip to main content