Every year, 2nd year Computer Science students at the University of Cambridge take part in a group project with an industrial client. This is a guest post from Team Alpha, who worked with Microsoft on their group project.
Team Alpha at the project demonstration fair
The government currently publishes large amounts of open data online, making it available to everyone, but not necessarily useful to everyone. Our task was to investigate how the Azure platform can be used to make air quality data truly open and accessible to the general public.
We are a team of 6 people from the University of Cambridge, and in order to achieve this goal, we developed Air Quality Radar. Air Quality Radar applies the concepts of radar maps and weather forecasting to air quality. We aimed to take only open data from the government and Met Office and use it to create an interactive way to explore the air quality in Cambridge over the past, present, and future.
During the initial phase of the project, we decided to build three main components: a web front-end with a primary component of a radar map of Cambridge, a set of machine learning models to enable air quality forecasting, and an API to interface between the two, also allowing easier data querying for researchers interested in accessing raw data. We were able to make full use of the Azure platform for these components — we built our models and created our predictions with Azure Machine Learning, and we hosted our website and API on Azure’s App Service platform.
Using Azure Machine Learning, we are able to quickly develop and evaluate simple machine learning models on a team-based platform. The user interface is easy to use for beginners and there are a lot of example projects on Cortana Intelligence Gallery to start with. A number of machine learning modules are already implemented on the platform and can be used directly, including commonly used classification and regression algorithms. There are certain limits on customisability of these modules but these can be overcome by using Python and R script modules which allows users to implement tailor-made algorithms.
During development, we also used GitHub and Travis to ensure that pull requests were tested and code reviewed before they could be checked in. This helped to maintain productivity by making sure the master branch always worked. The Azure App Service platform made it easy to automatically deploy the new version of our code straight from Travis when pull requests were checked in.
In order to forecast air quality, we developed and tested several machine learning models, and also collected and cleaned several sources of open data for training and testing. At the beginning, we start with directly putting in data and training several modules to get some experience on Azure ML.
In the very first models that we built, we tried to predict air quality solely based on weather and meteorological data, performing a random split on our data and using default settings for the machine learning algorithms. The resulting models did not perform particularly well, and some even had negative coefficients of confidence (meaning that the models predicted data worse than simply calculating the mean.
We soon realised that this poor performance might be caused by non-normalized data, poor parameters, and “dirty” input data where there are a high number of anomalies. We spent time on finding ways to clean data and using automated tools for parameter tuning. We also wrote custom Python scripts to expand our dataset with history data columns such that important factors including the maximum pollution level of the previous day can be used by the model. After discarding unnecessary weather factors and performing dataset cleaning, we developed models with satisfactory performance without too much dependence on historic data.
One of the machine learning training experiments. Each pipeline corresponds to one of the pollutants
The performance of one of our hourly prediction models
We also explored the opportunities in developing time series prediction models — these seem to fit our purpose. We imported several components from the Cortana Intelligence Gallery Library and customised them to fit our dataset. The resulting models also perform satisfactorily.
When developing our machine learning models, we also developed several experiments (small projects) in addition to using built-in evaluation tools to further test our models. These include bootstrap evaluation and independence evaluation. In bootstrap evaluation, we tested how the model predicted air quality further into the future using its own previous predictions. In independence evaluation, we tested the importance of certain factors by filling in random data in the corresponding columns.
The Bootstrap Evaluation module
The interactive map shows how the air quality varies across Cambridge, with the reddest areas having the worst area quality
The web front-end we created makes air quality information very easy to read, as well as being interactive and intuitive: the map can be dragged around and zoomed to check out the air pollution in various areas of Cambridge, and the slider at the bottom allows easy exploration of the air quality at different times. Using Angular allowed us to develop in a very modular way, which made team development much more productive, and tests easy to write.
The Air Quality Radar website, showing the map, time slider, and detail sidebar
To find out more about the project and try it for yourself, visit our website. Furthermore, to make our results truly open and accessible, we have decided to open-source all of our project. You can check out and contribute to the front-end, back-end, and machine learning scripts on GitHub. We hope that Air Quality Radar encourages further efforts to make open data truly open and accessible to all.
Project Final Presentation