Spatial Data Analysis with R

Article
07/09/2017

Guest blog by Jason Zhang Microsoft Student Partner at the University of Cambridge

Hi, I am Jason, a third year Natural Sciences student in University of Cambridge. With great interest in Microsoft and its work in Data Science, I joined the Microsoft Student Partner program in 2016.

I then attended the Microsoft Professional Program, Data Science track, which covered Transact-SQL, Power BI, Statistics, Machine Learning and Applied Machine Learning. The program provides me with a solid grounding for pursuing a career in Data Science and it is offered free to MSPs.

What is spatial data?

Spatial data is one of the most common data types in our everyday life, and it refers to all types of data objects or elements that are present in a geographical space or horizon. It enables the global finding and locating of individuals or devices anywhere in the world. Spatial data is also known as geospatial data, spatial information or geographic information.

Spatial data is used in geographical information systems (GIS) and other geolocation or positioning services. Spatial data consists of points, lines, polygons and other geographic and geometric data primitives, which can be mapped by location, stored with an object as metadata or used by a communication system to locate end user devices. Spatial data may be classified as scalar or vector data. Each provides distinct information pertaining to geographical or spatial locations.

How is spatial data represented?

Different programming languages have different representations but they share similar concepts.

For example, in R:

· Points can be represented by ‘SpatialPoints’ dataframe, which is a set of geographical points, represented by their latitudes and longitudes. The bounding box is also specified in the dataframe, namely the boundary values for the latitudes and longitudes of the points within the dataframe.

· Polygons can be represented by ‘SpatialPolygon’ dataframe, which represents a geographical polygon with various dataframes of other information about the represented polygon area. For instance, the city of London can be represented by a SpatialPolygonDataFrame which contains its population, famous tourist attractions, tube maps and so on.

· Grids can be represented by ‘SpatialPixels’ dataframe, which contains bounding box coordinates, properties of this spatial grid and it contains many more points than a spatial polygon as it is a grid structure.

With the help of these spatial dataframes, the spatial data can be easily represented, and further manipulated.

Figure 1. The Newhaven dataset in GISTools package contains spatial data, and for example breach is a SpatialPoints dataset.

Figure 2. blocks in Newhaven dataset is a SpatialPolygon Dataframe.

Furthermore, if the data is stored in SQL database, it is also possible to execute R (or other) code within Transact-SQL, by calling execute sp_execute_external_script, and specifying related details. With this approach, the data can be directly used in the R code.

In SQL server, there is also a data type for spatial data called ‘geography’, which can be used to represent spatial points, lines and polygons.

Visualisation of spatial data

Graphs are very often more informative than texts or tables, so visualisation is an essential part in spatial data analysis.

Firstly, to get a general idea of the spatial dataset, R provides many convenient tools:

· The plot function can be called to show the various components in the dataset, such as points, lines and polygons.

· The bubble function can be used to show the density distribution by the sizes of the bubbles.

· The choropleth (a colour shaded map) function can be used to create a plot demonstrating the block-by-block distribution of a certain property by the colour scale.

Figure 3. Visualisation of the dataset with simple plot functions.

Secondly, it is useful to explore how the nearby spatial points are correlated, for which purpose a variogram can be used: a function describing the degree of spatial dependence of a spatial random field or stochastic process.

As a concrete example from the field of gold mining, a variogram will give a measure of how much two samples taken from the mining area will vary in gold percentage depending on the distance between those samples. Samples taken far apart will vary more than samples taken close to each other.

To generate a variogram with data set x in R, first calculate the variogram using the variogram function, and then fit the variogram onto the data set with fit.variogram function.

Thirdly, to get a more interactive and detailed representation, PowerBI can be used:

· Import data from the source of choice, e.g. from SQL Server Database on Azure, and load them into the data model.

· The data will be automatically shown on a map (if not then choose map as the representation), and we can choose the size of the points to be proportional to the density of the pollution.

· Many other tools and fields can be used for further analysis, e.g. filter for neater and more concise representation, tooltips for extra fields.

Predictive models

Spatial data can be used to build various predictive models, for example kernel density estimation (KDE), K nearest neighbour (KNN), kriging or Gaussian process regression.

The detailed theories are well documented online, but the general ideas are:

· KDE is to fit a probability distribution model according to the density distribution of the data;

· KNN is to use x’s K-Nearest Neighbour to vote on what x’s label should be, and the voting scheme can be weighted by distances.

· Kriging is to predict the spatial distribution y(x) by minimising the variance of the prediction error, where the prediction error is the difference between true y(x) and predicted y(x). Or equivalently, Gaussian processes are to achieve the same goal by maximum likelihood interpretation.

KDE can be implemented in R by simply using the kde package:

· The kde.points function can be called on ‘SpatialPoints’ dataframes to generate a density estimator.

· The density distribution (also the probability distribution) can then be visualized by various plots, such as level plots (level.plot) and so on.

Kriging is implemented in R by krige function in geoR package:

· First generate a fitted variogram model of the spatial data set, as described in the previous chapter.

· Then apply kriging onto the fitted model, along with the data set, by calling krige function.

· Finally, we can visualise the result by spatial plot function spplot, or other plotting techniques.

Introduction to Microsoft R Server

Overview

R Server is an enterprise class server for hosting and managing parallel and distributed workloads of R processes on servers (Linux and Windows) and clusters (Hadoop and Apache Spark).

It provides an execution engine for solutions built using Microsoft R packages, extending open source R with support for high-performance analytics, statistical analysis, machine learning scenarios, and massively large datasets. Value-added functionality is provided through proprietary packages that install with the server.

Why R Server

The reason why data scientists starting with R Client or open source R typically move to R Server, is that very often the data size or computational scale require additional capacity, and R Server provides the infrastructure for distributing a workload across multiple nodes (referred to as data chunking), running jobs in parallel, and then reassembling the results for further analysis and visualization.

In addition to capacity and scale, R Server offers machine learning features and allows you to operationalize your analytics. You can use Microsoft R Server as the deployment engine for your advanced R analytics. Regardless of the source, language or method, you can simplify, deploy, and realize the promise and power of advanced analytics.

How to use R Server

R Server runs as a background process that starts up when you launch the Rgui or an R IDE such as R Tools for Visual Studio (RTVS), RStudio, or other applications. Generally, you can use any R IDE that can consume R packages.

Data scientists who use R Server typically connect over Remote Desktop, and then use RTVS or another to create or run solutions interactively. Solutions are usually script files that include a combination of R functions and functions from proprietary packages: RevoScaleR, MicrosoftML, mrsdeploy, RevoPema, and so forth.

Microsoft R Server can be found free under Microsoft Imagine.

Machine Learning

In R Server, you can use the MicrosoftML package, which provides state of the art, fast, scalable machine learning algorithms and transforms. These functions enable you to tackle common machine learning and data science tasks such as text and image featurization, classification, anomaly detection, regression and ranking. The goal is to help developers, data scientists, and an increasing spectrum of information workers to the design and implement intelligent products, services and devices.

This topic discusses these tasks and lists the key R functions provided by this package for transforming and modelling data that facilitate the completion of these data science tasks. You can also install pre-trained cognitive models for sentiment analysis and image featurization, when you select them in R Server Setup.

Summary

Spatial data plays an essential role in various areas and by adopting the appropriate analysis tools, applying the suitable methodologies, a great amount of information can be extracted from it. This information can then be used to build predictive models or further analysis, and Microsoft provides toolkits for the whole process:

· Data Collection with Azure IoT Suite

· Data Storage on Azure SQL Server

· Data Visualisation with PowerBI or Excel

· Data Manipulation with Microsoft R Server

· Predictive model with Azure ML