How we did it: PASS 2017 Summit Session Similarity using SQL Graph and Python


I had previously shared a sneak preview of our upcoming session on Graph data processing in SQL Server. The talk is at the PASS Summit 2017. In that post, I had promised to share more details closer to the session. And here it is!

Inferring Graph Edges using SQL ML Services

In many cases, the edges in a graph are deterministic and ‘known’ to the application. In other cases, edges have to be ‘inferred’ or ‘discovered’ by some code:

  • In some cases, node attributes can be used to detect similar nodes and create an edge
  • In other cases, an ETL process could use fuzzy lookups etc.
  • But, for more complex situations, ML Services in SQL Server 2017 and Azure SQL DB can be used as well! sp_execute_external_script can be used to invoke an R / Python script and get back a list of keys to build edges

In this walkthrough we will use ML Services in SQL Server 2017 to invoke a Python script to infer similar edges in a graph.

Approach

The nodes in this graph will be the sessions at PASS 2017 (with the data imported as per this previous post) and then we will use Python to invoke some language processing code to compute the measures of similarity between pairs of sessions, based on their Title and Abstract fields. In summary here is what we will do:

  • Our database has a node table with all the sessions from PASS Summit 2017
  • Sessions are saved as a Node table in SQL Graph
  • Session node has attributes like Session Id, Title, Abstract, Speaker Names and Track
  • Hypothesis: similar themed sessions have similar keywords in their Title / Abstract
  • Using NLP libraries in Python we can break down these sessions into underlying keywords and their frequency counts
  • Construct a “similarity matrix” and then return for each session, those sessions which have at least 20% similarity
  • Construct edges in SQL Graph for these related session pairs

Pre-requisites

We will be leveraging two powerful Python libraries: NLTK and Gensim, to help us analyze the text and derive a measure of similarity for pairs of sessions. While NLTK comes pre-installed with SQL Server 2017 ML Services, you have to install Gensim using PIP:

pip install stop_words
pip install gensim

We will then need to install a "corpus" of stop words for NLTK. This will help eliminate some common "noise" words from text to help improve the accuracy of the analysis. To do this we first create a folder for NLTK data:

md "C:\Program Files\Microsoft SQL Server\MSSQL14.SQL20171000\PYTHON_SERVICES\Lib\nltk_data"

Then we use nltk.download() to download and install the stopwords corpus as shown below. The important thing to note is to correctly escape the backslash characters in the path when providing it to the NLTK download GUI. In my case I used:

C:\\Program Files\\Microsoft SQL Server\\MSSQL14.MSSQLSERVER\\PYTHON_SERVICES\\lib\\nltk_data

Here's a screenshot of the above step in case you are wondering:

Once the stopwords corpus is downloaded, we proceed to create the necessary SQL node table, and then convert the previously prepared "regular" table into a Node table using INSERT...SELECT:

CREATE TABLE [dbo].[Session](
	[Index] integer IDENTITY(1,1) NOT NULL,
	[SessionID] integer NULL,
	[Abstract] [nvarchar](max) NULL,
	[SessionLevel] [int] NULL,
	[Speaker1] [nvarchar](100) NULL,
	[Speaker2] [nvarchar](100) NULL,
	[Speaker3] [nvarchar](100) NULL,
	[Title] [nvarchar](4000) NULL,
	[Track] [nvarchar](50) NULL
) AS NODE
GO

INSERT INTO Session (SessionID, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track) 
SELECT SessionID, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track FROM dbo.PASS2017Sessions;
GO

We then proceed to create an empty edge table:

CREATE TABLE SimilarSessions
(
SimilarityMetric float
)
AS EDGE

This table is implicitly going to hold the "from" and "to" nodes in the graph and additionally it holds a similarity measure value for that relationship.

Using Python (NLTK and Gensim) to compute session similarity

Now, that we have the tables in place, let's dig in and do the heavy lifting of text processing and analytics. Given below is the entire code which does the processing, but given it is complex, let me give you a high level flow prior to actually presenting the code. Here is what is happening in the below code:

  • The session data (titles, session ID, abstract, track and an incremental index number) are provided to the Python script from a T-SQL query (that query is at the very end of this code block)
  • Then NLTK is used to break down the title and abstract into words (a process called tokenization)
  • We then stem and remove stop words from the tokenized words
  • We then proceed to build a corpus of these words, taking only those words which have occurred at least 3 times
  • Then we proceed to use TF-IDF to prepare a document matrix of these words and their frequencies in various documents
  • Then, Gensim is used to compute "Matrix Similarity" which is basically a matrix of documents and how similar they are to each other.
  • Once the similarity matrix is built up, we then proceed to build the output result set which maps back the SessionId values and their similarity measures
  • In the above step, one interesting thing to note is that in SQL, graphs are directed. So we have to exclude situations where Session1 'is similar to' Session2 AND Session2 'is similar to' Session1.
  • Once this list of unique edges is built up, it is written back into SQL as edges in the SimilarSessions graph (edge) table by using a function called rxDataStep.

A small but important nuance here with rxDataStep and specifically SQL Graph edge tables, is that you need to exactly match the $from_id and $to_id column names with the actual values (including the GUID portions) that are in the edge table. Alternatively, you can avoid using rxDataStep and insert the output of the sp_execute_external_script into a temporary table / table variable and then JOIN back to the node tables to finally insert into the graph edge table. We will look at improving this experience going forward.

Take your time to understand the code! Here we go:

exec sp_execute_external_script @language = N'Python',
@script = N'
####
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.snowball import SnowballStemmer
from gensim import corpora, models, similarities
import gensim
import pandas as pd
from revoscalepy import RxSqlServerData, rx_data_step

# read data back in
pdDocuments = InputDataSet

tokenizer = RegexpTokenizer(r"\w+")
en_stop = get_stop_words("en")
stemmer = SnowballStemmer("english", ignore_stopwords=True)

def iter_documents(pdSeries):
    """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
    for (idx, docrow) in pdSeries.iterrows():
        concatsessionattributes = list()
        concatsessionattributes.append(docrow.Title.lower())
        concatsessionattributes.append(docrow.Abstract.lower())

        concatsessionattributesstr = " ".join(concatsessionattributes)

        tokens = tokenizer.tokenize(concatsessionattributesstr)
        # Remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        final = [stemmer.stem(word) for word in stopped_tokens]

        yield final

class MyCorpus(object):
    def __init__(self, pdSeriesInput):
        self.series = pdSeriesInput
        self.dictionary = gensim.corpora.Dictionary(iter_documents(self.series))
        self.dictionary.filter_extremes(no_below=3)
        self.dictionary.compactify()

    def __iter__(self):
        for tokens in iter_documents(self.series):
            yield self.dictionary.doc2bow(tokens)

corp1 = MyCorpus(pdDocuments)
tfidf = models.TfidfModel(corp1,id2word=corp1.dictionary, normalize=True)

train_corpus_tfidf = tfidf[corp1]
corpora.MmCorpus.serialize("train_ssd_corpus-tfidf.mm",train_corpus_tfidf)
train_corpus_tfidf = corpora.MmCorpus("train_ssd_corpus-tfidf.mm")

index = similarities.MatrixSimilarity(train_corpus_tfidf)

tfidf_sims  = index[train_corpus_tfidf]
# print (tfidf_sims)

similaritylist = []

def similarsessions(inputindex):
    print("Selected session: " + pdDocuments.loc[inputindex].Title)
    print()
    print("Most similar sessions are listed below:")
    print()
    topNmatches = tfidf_sims[inputindex].argsort()[-10:][::-1]
    for matchedsessindex in topNmatches:
        if (inputindex != matchedsessindex and round(tfidf_sims[inputindex][matchedsessindex] * 100, 2) > 20.0):
            rowdict = {}
            rowdict["OriginalSession"] = pdDocuments.loc[inputindex].SessionId
            rowdict["SimilarSession"] = pdDocuments.loc[matchedsessindex].SessionId
            rowdict["SimilarityMetric"] = round(tfidf_sims[inputindex][matchedsessindex] * 100, 2)

            # this graph effectively being a "Undirected Graph" we need to
            # only add a new row if there is no prior edge connecting these 2 sessions
            prioredgeexists = False

            for priorrow in similaritylist:# only add a new row if there is no prior edge connecting these 2 sessions
                # only add a new row if there is no prior edge connecting these 2 sessions
                if (priorrow["SimilarSession"] == rowdict["OriginalSession"] and priorrow["OriginalSession"] == rowdict["SimilarSession"]):
                    prioredgeexists = True

            if (not prioredgeexists):
                similaritylist.append(rowdict)

            print(str(matchedsessindex) + ": " + pdDocuments.loc[matchedsessindex]["Title"] + " ("  + str(round(tfidf_sims[inputindex][matchedsessindex] * 100, 2)) + "% similar)")

for sessid in range(len(pdDocuments)):
    similarsessions(sessid)

print(similaritylist.__len__())

finalresultDF = pd.DataFrame(similaritylist)

# rename the DF columns to suit graph column names
finalresultDF.rename(columns = {"OriginalSession":"$from_id_C19A274BF63B41359AD62328FD4E987D", "SimilarSession":"$to_id_464CF6F8A8A1406B914D18B5010D7CB1"}, inplace = True)

sqlDS=RxSqlServerData(connection_string = "Driver=ODBC Driver 13 for SQL Server;Server=.\\SQL2017;Database=PASS-Demo;trusted_connection=YES"
, table="dbo.SimilarSessions")

rx_data_step(finalresultDF, output_file = sqlDS, append = ["rows"])
', @input_data_1 = N'SELECT CAST((ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) - 1 AS INT) as RowIndex, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track, $node_id AS SessionId FROM Session'

Once the above code is executed, the SimilarSessions table is populated with edges! Then we can query that table using regular T-SQL and the new MATCH predicate in SQL Graph. For example below we look at sessions similar to my colleague Denzil's session:

SELECT TS.SessionId, TS.Title, SimilarityMetric
FROM SimilarSessions SS, [Session] OS, [Session] TS
where MATCH (OS-(SS)->TS)
AND (OS.SessionId = 69503)
UNION ALL
SELECT OS.SessionId, OS.Title, SimilarityMetric
FROM SimilarSessions SS, [Session] OS, [Session] TS
where MATCH (OS-(SS)->TS)
AND (TS.SessionId = 69503)

Here is the output of that query:

Sessions similar to Denzil's session

I'm sure you will agree, looking at the above, these are greatly correlated sessions and would be a great recommendation for anyone already viewing Denzil's session!

Visualization - GraphML

Now, the last part is how to visualize the above graph in some capable tool. SQL does not ship with native visualization for graphs, and the main reason for this is that preferences on the visualization are hugely varied and we do not want to enforce anything specific from our side. Instead, we recommend using standard tools like d3.js, Gephi etc. In my case, I chose to use a very powerful tool called Cytoscape. Now, many of these tools understand a standard format for representing graphs, called GraphML. This format is XML and hence it is easy to use T-SQL to generate GraphML corresponding to our graph! Here's the code to do this:

CREATE OR ALTER PROCEDURE CreateGraphML
AS
BEGIN
    DECLARE @prolog AS NVARCHAR (MAX) = N'<?xml version=''1.0'' encoding=''utf-8''?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key attr.name="weight" attr.type="long" for="edge" id="d3" />
  <key attr.name="SessionId" attr.type="string" for="node" id="d2" />
  <key attr.name="Track" attr.type="string" for="node" id="d1" />
  <key attr.name="Title" attr.type="string" for="node" id="d0" />
  <graph edgedefault="undirected">
';
    DECLARE @epilog AS NVARCHAR (MAX) = N'
  </graph>
</graphml>
';
    DECLARE @nodeXML AS NVARCHAR (MAX) = (SELECT   *
                                          FROM     (SELECT 1 AS Tag,
                                                           0 AS Parent,
                                                           S.SessionId AS [node!1!id],
                                                           NULL AS [data!2!!element],
                                                           NULL AS [data!2!key]
                                                    FROM   dbo.[Session] AS S
                                                    UNION ALL
                                                    SELECT 2 AS Tag,
                                                           1 AS Parent,
                                                           S.SessionId,
                                                           CONCAT(S.Title, CHAR(13), CHAR(10), CONCAT('(', S.Speaker1, IIF (S.Speaker2 IS NULL, '', CONCAT(',', Speaker2)), IIF (S.Speaker3 IS NULL, '', CONCAT(',', Speaker3)), ')'), ' [', ((SELECT COUNT(*)
                                                                                                                                                                                                                                               FROM   SimilarSessions AS SS
                                                                                                                                                                                                                                               WHERE  SS.$FROM_ID = S.$NODE_ID) + (SELECT COUNT(*)
                                                                                                                                                                                                                                                                                   FROM   SimilarSessions AS SS
                                                                                                                                                                                                                                                                                   WHERE  SS.$TO_ID = S.$NODE_ID)), ' connections]'),
                                                           'd0'
                                                    FROM   dbo.[Session] AS S
                                                    UNION ALL
                                                    SELECT 2 AS Tag,
                                                           1 AS Parent,
                                                           S.SessionId,
                                                           S.Track,
                                                           'd1'
                                                    FROM   dbo.[Session] AS S
                                                    UNION ALL
                                                    SELECT 2 AS Tag,
                                                           1 AS Parent,
                                                           S.SessionId,
                                                           CAST (S.SessionId AS NVARCHAR (200)),
                                                           'd2'
                                                    FROM   dbo.[Session] AS S) AS InnerTable
                                          ORDER BY [node!1!id], [data!2!!element]
                                          FOR      XML EXPLICIT);
    DECLARE @edgeXML AS NVARCHAR (MAX);
    WITH   Edges
    AS     (SELECT OS.SessionId AS source,
                   TS.SessionId AS target,
                   CAST (SS.SimilarityMetric AS INT) AS data
            FROM   SimilarSessions AS SS, [Session] AS OS, [Session] AS TS
            WHERE  MATCH(OS-(SS)->TS))
    SELECT @edgeXML = (SELECT   *
                       FROM     (SELECT 1 AS Tag,
                                        0 AS Parent,
                                        source AS [edge!1!source],
                                        target AS [edge!1!target],
                                        NULL AS [data!2!!element],
                                        NULL AS [data!2!key]
                                 FROM   Edges
                                 UNION ALL
                                 SELECT 2 AS Tag,
                                        1 AS Parent,
                                        source,
                                        target,
                                        data,
                                        'd3'
                                 FROM   Edges) AS InnerTable
                       ORDER BY [edge!1!source], [edge!1!target], [data!2!!element]
                       FOR      XML EXPLICIT);
    SELECT CONCAT(@prolog, @nodeXML, @edgeXML, @epilog);
END
GO

EXEC CreateGraphML;
GO

/* Run from CMD prompt:
bcp "EXEC CreateGraphML" queryout PASS2017.xml -T -S.\SQL2017 -dPASS-Demo -C65001 -c
*/

And that's it! When you run the BCP command line from CMD prompt, it will create a PASS2017.xml file, which is internally in the GraphML format. That's easily imported into Cytoscape or other such graph visualization tools. And that is how we created the fun visualization that you saw in the "sneak preview" blog post!

Disclaimer

This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment.  THIS SAMPLE CODE AND ANY RELATED INFORMATION ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE.  We grant You a nonexclusive, royalty-free right to use and modify the Sample Code and to reproduce and distribute the object code form of the Sample Code, provided that You agree: (i) to not use Our name, logo, or trademarks to market Your software product in which the Sample Code is embedded; (ii) to include a valid copyright notice on Your software product in which the Sample Code is embedded; and (iii) to indemnify, hold harmless, and defend Us and Our suppliers from and against any claims or lawsuits, including attorneys’ fees, that arise or result from the use or distribution of the Sample Code. This posting is provided "AS IS" with no warranties, and confers no rights.

Comments (2)

  1. Great blog, Arvind!
    Summaries your talk at PASS Summit up very well.

    In the last sample-code in lines 27 to 32 there seems to be something wrong, though.

    1. Ah – forget the comment about lines 27 to 32 … I did not see, that there is code on the far right of the script, where I had to scroll to.

      So my above comment is only about to congratulate to a great blog post!

Skip to main content