Computer Vision Workshop using Microsoft Computer Vision API, and Cognitive Tool CNTK+KERAS

Guest post from Shu Ishida Microsoft Team Leader at Oxford University


My name is Shu Ishida. I am a third year engineering student, and team lead for the Microsoft Student Partner Chapter at the University of Oxford. I am into web development and machine learning, and my ambition is to help break down educational, cultural and social barriers by technological innovation. My current interest is in applications of Artificial Intelligence to e-Learning. I also have passion towards outreach and teaching, and have mentoring experiences in summer schools as an Oxford engineering student ambassador.


Workshop summary

On 16th February, Oxford Microsoft Student Partners ran a Computer Vision Workshop at Christ Church college. We had a wide range of audiences from beginner programmers to more experienced programmers. We gave out detailed handouts so that people can follow / look ahead according to their level of experience.

In the workshop, we discussed three methods of image classification;

1. Microsoft Computer Vision API

2. Microsoft Custom Vision Services

3. Cognitive Toolkit + Keras.

In this blog, I’m going to go over the topics that was covered in the workshop.

Please find the copy of the full handout and code at my GitHub repository.


What is an API?

API (Application Programming Interface) is about how the communication between your laptop and the server is defined. API is an interface between the laptop and the computer, similar to a remote controller that acts as an interface between the user and the TV. To use the service, you just have to press the right button.

If you send an image you want it to classify, you will get the information you require.


Now let’s draw a slightly different analogy. Communication between a laptop and the server is like delivering a post, except that it is way faster. Like a post, the information sent have two parts – the header and the body. The header is like the envelope, containing information of where the post is to be delivered. Instead of calling it a delivery address, we call this an endpoint URL.


Often you pass details of the data you are passing, or specifications of the desired response as parameters. For example, if you are passing a soundwave data to a speech recogniser, you want to tell it which language the speech is in, the sampling frequency of the soundwave, etc. and also specify what kind of response you want (whether you are just interested in getting the transcription, or if you also want alternatives, confidence, timestamps, etc.)

Unlike humans, who can read a letter without being told whether it is written in English, French or some other language, computers must be told which format the content is in. JSON is a popular format which is convenient for passing data (XML used to be another popular format). Therefore, we want to set the Content-type to application/json.

You want to include a subscription key that you will get upon enrolling to the API service. This is required since most companies don’t want to open up their service for charity. Once you start using APIs and cloud services, you will have tones of subscription keys for different services. Make sure you never share your subscription keys online on places like GitHub or YouTube.

Last but not least, you set the data you want to be processed in the body (the content). This will be the image you want to classify for this workshop, but it could be soundwaves, video URLs or other data depending on the API you are using. The body will be formatted in the format we have specified, in this case JSON.

Getting Started with Computer Vision API

Computer Vision API is one of Azure Cognitive Services. Follow the link and click ‘Try Cognitive Services for free’, and you will find a button to get the API key.

Have a look through the documentation. We will start with the Python documentation. The documentation walks you through the details of every line of the code. If you want to get to the code quickly, clone (or download) my GitHub repository and open Set the variable ‘subscription_key’ to the API key for the Computer Vision API that you have obtained just now.

Now run the code. If you see an output that looks like this, you are properly communicating with the API!

{'categories': [{'name': 'outdoor_', 'score': 0.00390625, 'detail': {'landmarks': []}}, {'name': 'outdoor_street', 'score': 0.33984375, 'detail': {'landmarks': []}}], …


Sort out your cluttered photo collection

Image recognition is great if you don’t want to spend hours on labelling them or searching for the image you want. In this section, we’ll see how to navigate through locally stored images (photos on your laptop) and label them automatically. loops through files in the directory that you will specify, and calls out async def image_CV(data) whenever it finds an image. Note that now the function is receiving the image data rather than the image URL. The data is in computer readable binary format. According to the documentation, “the binary image data must be passed in via the data parameter to as opposed to the json parameter.” You also have to indicate that you are passing binary data as your content of the request, so set 'Content-Type': 'application/octet-stream'.


Run the python code. Now you have to pass an additional parameter to indicate which folder contains the images you want analysed. I called this folder images, so I can run “python .\ .\images\”. Feel free to add more images into this folder to get more responses. It might take a while if the image is large.

Export your results to a CSV file

Comma-Separated Values (CSV) file is a useful and light-weight representation of table structure data. You can open this file on Excel and it is widely compatible to other spreadsheet applications. You will see CSVs used in machine learning and other information analytics as well. shows you how to create a CSV file. Once you have run the code and are satisfied, open which exports your results that you got in to a CSV file named ‘album.csv’. It extracts the caption and tags and stores them along with the image file path.

Text recognition – make your hand-written notes permanent!

Probably the most useful part of the technology for us students is text recognition. Amazingly, Computer Vision API not only transcribe typed text but also recognises hand-written text, even with cursive writing! (given that it’s also legible for humans…)

This part of the documentation talks about how to use the text recognition capability of the API. The tricky part is that, unlike the previous implementations, the text recognition service does not return the recognized text by itself. Instead, it returns an "Operation Location" URL. We have to call this URL separately once the recognition result is ready.

Run There is a lot of information given about the location of the text. This would be useful for reconstructing the entire collection of texts onto the 2D image. Here, we are using matplotlib to overlay the text onto the original image. Matplotlib is another famous library for showing different kind of graphs and images.


Mini-challenge: transcribe your lecture notes

For experienced programmers, now is the time for a hack! Similar to what we have done for the album_to_csv programme, write a programme that transcribes all the pictures of your lecture notes / white boards, and export it into text files. Maybe you can name the text file according to the date and time of the lecture, or you can even extract the title of the slide from the picture.
If that is not enough, you can even look into indenting and styling the text according to the position and sizes of each of the boxes containing the block of text.

When to choose Custom Vision Services

So far, we have explored different features of Computer Vision API, which has been trained to suit high-demand purposes such as labelling objects and extracting text. There are other features to it as well that we haven’t explored, such as identifying celebrities and landmarks.

However, sometimes this general-purpose API isn’t enough to serve one’s needs. Maybe you want to build a classifier that is trained to identify subtle features, or to return a very detailed classification (a tag ‘plant’ is not detailed enough if you want to classify different species of plants). That is when Custom Vision Services will be useful.

Custom Vision Services is by far the easiest way to train your own classifier. You just have to upload a dozen of images and label them according to how you want it to classify your image. That’s really it! You immediately have a trained model that can go above 90% precision and recall, which you can visually examine.

Getting Started with Custom Vision Services

I would walk through you the steps, but there is already an easy-to-follow tutorial prepared by Microsoft. Cloning the GitHub repository is a slow process since everything is under one massive repository, so I’ve already included the necessary files under my repository. Once you’ve followed each step of the tutorial, you should get an app running as shown below.


In fact, this tutorial is also a great introduction to Node.js, which is server-side JavaScript to run web applications. What is more, you can create a desktop application with an extension called Electron, which we are using for this tutorial. No more hacking around GUI and window objects in C# - this alternative gives you a very quick-to-build solution that enables you to develop desktop apps just using HTML/CSS.

Now you are set to create your own image classifier on Custom Vision Services.

Classification with frameworks

Sometimes you want to have the fullest control over your model, changing the number or size of layers in your Neural Networks, or detecting latent features. Or maybe you just want to have a good enough classifier for a product without having to pay for the API subscription.

This is when you want to start building your own Neural Network model. Thankfully, we don’t have to implement Neural Networks from zero. There are many libraries for machine learning and Neural Networks available which we could use.

Machine Learning made easy by frameworks

Many machine learning frameworks have been developed by large companies. Caffe, CNTK, TensorFlow and Torch are some of the popular frameworks available. Cognitive Toolkit (CNTK) developed by Microsoft outperforms other frameworks for multiple GPUs and is the go-for framework for distributed computing, while TensorFlow developed by Google has strong community support and resources.


Using these frameworks has significant advantages:

1. You don’t have to know about the mathematical details of Neural Networks, which involves a lot of calculus and linear algebra. You can just determine the structure of the Neural Network you want to build, and the framework will take care of implementing the algorithm.

2. Better computation speed could be achieved by using frameworks, since it is compiled and run on a machine friendly format. They typically convert a series of vector and matrix operations into what is called computation graphs, which is then compiled and run on C++ or some other low-level language. Frameworks like CNTK and TensorFlow also take advantage of GPUs. This helps increasing the training speed of the model.

3. Because you are not writing basic low-level functions but can focus on the high-level implementations, you can quickly test your ideas and achieve a faster development loop.

Installing CNTK / TensorFlow

How you want to install CNTK / TensorFlow onto your machine depends on the OS and the environment you are working on, but generally it could be achieved by a simple pip installation. How to install CNTK / How to install TensorFlow

Introduction to Keras

CNTK and TensorFlow are great, but is still a complicated framework for beginners. To make things easier, we can use a front-end framework called Keras; it is a user-oriented framework that helps you build Neural Networks with minimum coding. It works together with CNTK, TensorFlow or Theano (another ML framework) as a backend, and runs models using its capability.

The great thing about Keras is that you can test out the same Neural Network on different backends by simply modifying one line in the configuration. See this documentation to learn how to swap your backend.

Image recognition with Keras

I would like to give credit to Greg Chu for this blog article on how to build an image classifier using Keras. Go to the GitHub repository and clone it onto your machine. The chapter for image recognition essentially addresses the problem we solved by using Microsoft Computer Vision API earlier in the workshop. This Neural Network downloads the parameters already trained with ImageNet dataset (a large dataset of labelled images used for general computer vision training and validation purposes) so that it gives us the predicted labels without us having to train the model ourselves.

Similar to the Computer Vision API, this code allows us to feed the image data either as binary data or by passing the image URL.


The second chapter discusses the concept of transfer learning. Rather than training an image classification model from scratch for every single custom model we build, which could take up weeks to train before it can identify useful features to classify objects, it is way more efficient and reliable to use pre-trained model parameters for the first couple of layers for your Neural Network, and then to append some custom trainable layers at the very end to train the model specific for your purpose.

This is the secret behind why Microsoft Custom Vision Services achieve such high prediction accuracy with just 6 or 7 training images in a short training time. It already has pre-trained Neural Network layers and is customising the network according to your training dataset using the last few layers.

Closing remark

We have covered different methods and uses of image recognition. If you are developing an app, it is great to know that you can mount computer vision onto your app in 5 min, just by using APIs or frameworks. This will make it easy for you to build a prototype, enabling you to test and improve your idea rapidly.


Future events

We will be organising more tech events later this term and the next academic year. Please follow our Facebook page for more information

Skip to main content