Sanjay Mishra, Arvind Shaymsundar
Reviewed By: Joe Sack
SQL Server 2016 has several new features with SQL Server R Services being one of the most interesting ones. This feature brings data science closer to where most data lives – in the database! It also opens up a world of extensibility to pure database developers by allowing them to write powerful scripts in the R language to complement the T-SQL programming surface area already available to them. In this post, we show you a great example of how you can leverage this awesome feature.
The Shortest Path Problem
Take for example, the classic problem of finding the shortest path between 2 locations. Specifically, let’s say you are a pilot or a flight planner trying to construct a flight plan between 2 airports. And let’s say that all the data about the locations of these airports, and the ‘airways’ (the well-defined paths to follow in the sky) are all stored in the database. Now, how do you find out the shortest path between these two airports? Here are two approaches to do so: the classic T-SQL way, and then the R Services way.
But first, let’s look at the data we have. To make this realistic, we imported data from the FAA’s 56-day NASR navigation data product into SQL Server 2016. After importing and some post-processing, we end up with 2 simple tables:
- The Node table has details of airports and predefined navigational points. Each such ‘Node’ is identified by the ‘Name’ column. For example, Seattle-Tacoma airport is identified by the name ‘KSEA’ which is the international aviation standard name for this airport. A numeric Id column is used as a key column.
- The Edge table contains the well-known paths (in aviation parlance ‘airways’ between airports and navigational points) between the Nodes defined above. Each such path has a ‘Weight’ column which is actually the distance in meters between the 2 nodes for that path. This Weight column is therefore very important because when you are computing shortest paths, we want to minimize the distance.
Using T-SQL to find shortest paths
In such a case, if a developer were to implement Dijkstra’s algorithm to compute the shortest path within the database using T-SQL, then they could use approaches like the one at Hans Oslov’s blog. Hans offers a clever implementation using recursive CTEs, which functionally does the job well. This is a fairly complex problem for the T-SQL language, and Hans’ implementation does a great job of modelling a graph data structure in T-SQL. However, given that T-SQL is mostly a transaction and query processing language, this implementation isn’t very performant, as you can see below.
-- The T-SQL way (from http://www.hansolav.net/sql/graphs.html) -- The below query executes Hans Oslav’s implementation of Dijkstra’s algorithm on with the Node ID values corresponding to Seattle (airport code SEA) and Dallas / Fort-Worth (airport code DFW) respectively. exec usp_Dijkstra 24561, 22699
The execution of the above completes in a little under a minute on a laptop with an i7 CPU. Later in this post you can review the timings for this route and another route from Anchorage, Alaska (airport code ANC) to Miami, Florida (airport code MIA).
Enter SQL Server R Services
In case you have not used SQL Server R Services previously, our previous blog post will be a great starting point. In that post, Joe Sack provides many ‘getting started’ links, and a comprehensive description of real-world customer scenarios where this feature is being used.
Getting Started: setting up R Packages
Firstly, we need to ensure that we have correctly configured and validated the installation of R Services (in-database). Follow the instructions here to make sure R Services is working correctly within SQL Server 2016.
Now, one of the most powerful things with R is the extensibility it allows in the form of packages. Developers can tap into this extensible set of libraries and algorithms to improve certain cases which T-SQL does not handle very well – one example being the above shortest path algorithm.
It turns out that R has a very powerful graph library – igraph – which also offers an implementation of Dijkstra’s algorithm! Let’s see how we can leverage that to achieve our purpose. So we need to follow the steps here to install the igraph and jsonlite packages. Exactly how these packages help us in solving this shortest path problem is explained in the next section.
Calling the R Script from T-SQL
Take a minute to review the completed script in the next section. That script accomplishes the same task (finding the shortest flight path between Seattle and Dallas Fort-Worth) but by using SQL Server R Services.
Let’s break down what is in the completed script. Note that the R script itself is stored in the T-SQL string variable called @RScript. Let us further break down what that R script is actually doing:
- Notice the use of the R library() function to import the igraph and jsonlite libraries that we previously installed.
- Later, we use the jsonlite library’s fromJSON function to parse the values in the Nodes and Edges variables, which are supplied from the T-SQL side of this script in JSON format. The reason for using this approach is because SQL Server R Services today only supports one input dataset to be supplied to the R script.
- We then use data.frame to use the edges and nodes supplied to construct the graph using the igraph library.
- Then, we use the paths function on the graph, specifying the source and destination nodes.
- Then we compute the total distance travelled on this shortest path and stores it in the TotalDistance
- We then compute the actual path in the form of Node IDs (stores that in the PathIds variable) and also in the form of human-readable navigation point identifers (stored into the PathNames variable).
- The final part of the R Script uses frame to build what will eventually be returned as a T-SQL result set with 5 columns, so that it is equivalent to what Hans Oslav’s stored procedure was returning.
Now that you have understood the R portion of the script, let’s look at how the R script is invoked from the main T-SQL body.
- We use the sp_execute_external_script system stored procedure to invoke the R script that we just declared a bit earlier.
- As mentioned earlier, the current version of sp_execute_external_script only allows one input dataset. So we have to pass in the Nodes and Edges required for Dijkstra’s algorithm as parameters. The new FOR JSON clause in T-SQL allows us to pass this data in is an efficient way.
- The call to sp_execute_external_script also shows how variables from the R script are mapped to T-SQL variables:
|R Variable||T-SQL output parameter|
4. The last part of the T-SQL script simply converts the distance returned by the script (which is in meters) to miles and the aviation unit of nautical miles.
Here is the complete script for ready reference:
-- Dijkstra’s algorithm using R (runs in a few seconds) declare @SourceIdent nvarchar(255) = 'KSEA' declare @DestIdent nvarchar(255) = 'KDFW' declare @sourceId int = (select Id from Node where Name = @SourceIdent) declare @destId int = (select Id from Node where Name = @DestIdent) DECLARE @RScript nvarchar(max) SET @RScript = CONCAT(N' library(igraph) library(jsonlite) mynodes <- fromJSON(Nodes) myedges <- fromJSON(Edges) destNodeId <- ', @destId,' destNodeName <- subset(mynodes, Id == destNodeId) g <- graph.data.frame(myedges, vertices=mynodes, dir = FALSE) (tmp2 = get.shortest.paths(g, from=''', @sourceId, ''', to=''',@destId , ''', output = "both", weights = E(g)$Weight)) TotalDistance <- sum(E(g)$Weight[tmp2$epath[]]) PathIds <- paste(as.character(tmp2$vpath[]$name), sep="''", collapse=",") PathNames <- paste(as.character(tmp2$vpath[]$Name), sep="''", collapse=",") OutputDataSet <- data.frame(Id = destNodeId, Name = destNodeName$Name, Distance = TotalDistance, Path = PathIds, NamePath = PathNames) ') DECLARE @NodesInput VARCHAR(MAX) = (SELECT * FROM dbo.Node FOR JSON AUTO); DECLARE @EdgesInput VARCHAR(MAX) = (SELECT * FROM dbo.Edge FOR JSON AUTO); declare @distOut float DECLARE @PathIdsOut VARCHAR(MAX) DECLARE @PathNamesOut VARCHAR(MAX) EXECUTE sp_execute_external_script @language = N'R', @script = @RScript, @input_data_1 = N'SELECT 1', @params = N'@Nodes varchar(max), @Edges varchar(max), @TotalDistance float OUTPUT, @PathIds varchar(max) OUTPUT, @PathNames varchar(max) OUTPUT', @Nodes = @NodesInput, @Edges = @EdgesInput, @TotalDistance = @distOut OUTPUT, @PathIds = @PathIdsOut OUTPUT, @PathNames = @PathNamesOut OUTPUT WITH RESULT SETS (( Id int, Name varchar(500), Distance float, [Path] varchar(max) , NamePath varchar(max))) -- here we format the result in different units of distance - miles and nautical miles SELECT @distOut * 0.00062137 AS DistanceInMiles, @distOut * 0.00053996 AS DistanceInNauticalMiles
This script is much quicker and produces similar output to the T-SQL implementation. Figure 2 compares the execution times using the two implementations:
R Services extends the programming surface area that a Data Engineer has. R Services offers capabilities which nicely complement what T-SQL classically offers. There are some things which R does very well (such as computing shortest paths efficiently) which T-SQL does not do all that well. On the other hand, T-SQL will still excel in tasks for which the database engine is optimized for (such as aggregation). These two are here to play together and bring Data Science closer to where the Data is!