Bored of MNIST? Let’s build your own OCR deep learning computer vision AI using Microsoft CNTK with EMNIST (Step by step guide)

Guest post by Chih Han Chen , Microsoft Student Partner from Imperial College London.


I am currently a second year PhD student at Imperial College London. My research is mainly on expert systems and artificial intelligence for personalized decision based on genetics. I am interested in the application of informatics, big data, machine learning, data value chain and business modelling.

My LinkedIn profile link.

My GitHub for this project.



picture source from: [1]

Following my previous post, “Build your first deep neural network with Microsoft A.I. tool CNTK (Step by step guide)” from here, we would like to move on to something more advanced on deep learning. (If you have not read the previous blog and having little background I would suggest you to have a quick look.) In this blog, we are going to implement a computer vision model called optical character recognition (OCR) with a step-by-step guide. There are two main branches for deep learning neural network: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Our model structure is a CNN using CNTK with a latest dataset called EMNIST (released April. 2017). This blog is an implementation guide focusing on programming. However, theories will be attached as much as possible in links.

Introduction to OCR

Optical Character Recognition (OCR) is a field of research in pattern recognition, computer vision and artificial intelligence. It is used to capture texts from scanned documents or photos.


picture source from: [2]

OCR is used as an information entry from data records such as printed paper, hand writing, signs, photos and etc. Among emerging applications, smart phone camera is one of the most frequently used input and OCR models are utilized as part of app functions. For example the following figure shows a translation app with camera.


Picture source from: [3]

Digitising the analogue world with electronic sensing and electronically edit pre-processing are usually required before computer vision recognition. For example, after using a camera, taking a picture and converting the data into the specific format usually have to be done before feeding to our model. However, we are safe from this step since we have a processed dataset which will be introduced in the next section.

Introduction to EMNIST dataset

The Extended Modified NIST (EMNIST) dataset, derived from NIST Special Database 19 is a set of handwritten character digits of [0-9], [a-z] and [A-Z]. Images of those handwritten characters are converted to 28x28 pixel as shown in the figure below. The EMNIST dataset takes the format of “Resized and Resampled” as in (e) of the figure.


Picture source from: here [4]

This dataset is designed as a more advanced replacement for existing neural networks and systems. There is a paper introducing this dataset, explaining the conversion process for creating the images see reference [1]. There are different parts within the dataset that focus only on numbers, small or capital English letters. In this blog we take the most challenging task that involves all numbers (10), all small (26) and capital (26) English letters to create a realistic computer vision OCR in English environment.

Setting your machine up

Before we jump to coding, let’s set your environment up.

Key things to be installed:

1. Python

2. openmpi-bin (for CNTK to work with your machine)


4. Other packages (for this task)

For python installation, check the reference for Windows or reference for Linux

Or you can install the Data Science Virtual Machine from Microsoft which has all these tools including CNTK preinstalled.


All the above tools and services are preinstalled on the Microsoft Data Science VM on Windows 2012, 2016, CentOS or Ubuntu

Learn more about the DSVM Webinar Link:

More Product Information: Data Science Virtual Machine Landing Page
Community Forum: DSVM Forum Page

For the openmpi-bin and CNTK, if you are using Linux you can follow either my GitHub guide or the official guide. If you are using windows please follow the official guide.

For other packages, check and install from links: ‘matplotlib’, ‘numpy link1’, ‘numpy build from source’. Furthermore, some libraries such as ’time’, ‘sys’, ‘os’, ‘__future__’, ‘urllib’, ‘zipfile’, ‘csv’, ‘re’ are assumed to be built in. Note that compared to the previous post we have used three more packages ‘struct’, ‘shutil’, ‘gzip’.

Task for this blog

In this blog, we will first download the dataset and then train a CNN model to perform recognition of English words + numbers. We will evaluate our model by the end of the post. You can simply download the codes from my Github and execute on the terminal by typing “python” or follow the step-by-step guide as follows.

Start python by typing in “python” on your terminal or open Jupyter by typing in “jupyter notebook”, then a browser will pop up, select “New -> Python”. In this blog the print screens are from jupyter notebook.

Data Download

Before we create our model, let’s first download the EMNIST dataset with the following codes.


Then all the required packages are imported. If you stuck with error of package does not exist, with Linux machine you can simply open another terminal and type “sudo pip install (package name)” to download the package.


Here we define a function to report the progress of our download process.


The dataset is directly downloaded to the /tmp/ location with the above codes, if you are using windows please change the path to your desired location(replace /tmp/)


This dataset is relatively large on Jupyter notebook. Therefore a warning may pop up. Ignore this warning and wait! (grab a coffee maybe?)


After finishing the download process, let’s unzip the files with above lines.


Here, we load the data at the same time check the formats.


We read out the labels and check the formats.


Finally, we define a try_readout function to perform loadData and loadLabels, while stacking them together.


After the function definitions, we specify the location of data. However, since we have already downloaded the files, the url is set to a path of the machine we are using. Note that if you are using Windows, please change the path “/tmp/” to your own path.


You can check out the datasets and labels with the above codes. (The image is rotated and reversed by default)



After checking the data, we can save them into a txt format using the above codes (may take a while since data is relatively large)

Build our model

After downloading our dataset and saving them into a txt format, we are now ready to build and train our CNN model.


Import the required packages with the above codes.


We define the input shape as 28x28 and the output as 62 nodes, which represent the 26 capital letters+ 26 lower case letters + 10 numbers (0 to 9). Furthermore, we set the file paths for training and testing. In the following sections, we define functions for reading data and creating model.


For ease of latter use, reader is predefined with the above codes to create a figure to describe the location and the use of the function.


Model is created with the above codes having 2 layers of convolution and a dense layer for the output.


We set our model with the name “z”, the input of the model is set to x which has the shape of (1,28,28) while y is set to the shape of 62 (all letters and numbers). With the following code we can print out the shape of all layers within our CNN model.


For the reason that we set our strides into (2,2) and filters being 8 and 16. The data of the first dimension varies through input to output from 1, 8 to 16, while second and third dimensions changes from 28, 14 to 7. Finally, the last layer is fully connected to the 62 output nodes. To help understanding the shape and layers, see the figure below.


A good description to understand this can be found on CNTK official tutorial or Here.

Be aware of the fact that the number of nodes in each layer has to be integer, and if not error will occur.

Train our model

In the following sections we will first define function and then train our models.


With the above code, we take the build-in function cross entropy with softmax and classification error to measure the loss and error. The detail of the functions can be found here.


Moving average is defined with the above function. For the detail explanation please check here.


We report the training process results of each minibatch with the above function.


Here we define the training function using build-in and previously defined functions. In the code above, a test step is included in the training function to report how well our model has achieved. The training execution takes batch by batch which means we update our parameter each time based on observing a subset of our training data. Detail of batch learning can be found here.


We wrap up all steps with above function. First the model is created with create_model. Then after reading in training and testing data with create_reader, the model executes training with train_test.


Start executing the model creation and training. We have reached average test error 18.44%, which means 81.56% accuracy in average. Can we improve this?


With introducing two max pooling layers into our CNN model, let’s try it again. Details about Max Pooling layer can be found here.


The executed result shows a bit of improvement to 82.03%. Since this classification task with EMNIST dataset is much harder than tasks with simple MNIST, we would not be surprise receiving this accuracy. However, improvement can be done by playing around with the parameters or introducing extensions or other architectures. Let’s leave our model with 8X% accuracy and dive into manually evaluations by directly looking into the dataset.

Manual Evaluation

Before we start looking into the output, let’s first introduce a layer of softmax onto our output.



We use the previously defined function to create data readers for evaluation. In here, we load a subset of data with size of 25.


We load the data and change the shape into proper input format (1, 28, 28).


We execute prediction with our model using for loop. The predicted result is store in “pred”, while we load the correct result in ”gtlabel”. Let’s take a look of the results with print function.


It is also possible to show how the actual images look like with the following functions.



Summary and Extension:

In this blog, the main purpose is to demonstrate the performance of CNN model with CNTK using the latest dataset EMNIST. As a summary, we have demonstrated the CNN model capability on the EMNIST dataset achieving at least 80%. This dataset can be used to train practical OCR tools for English letters and numbers. In fact a fun extension of this blog is to implement smart phone apps. Or you can even use your own handwriting images onto the trained model following the last section “Manual Evaluation” of this blog. Other architectures of neural network such as RNN(LSTM), CRNN, or even other branches of machine learning can also be applied here to improve accuracy. Just try it out and have fun. A lot of other interesting projects can be found in tutorials and examples on the CNTK official website.

Reference images:





Reference links:


Comments (0)

Skip to main content