Looping over document similarity details in Semantic Search

Article
10/31/2011

Statistical Semantic Search, which is new in SQL Server 2012 “Denali,” indexes documents and text columns to support queries that go beyond the simple keyword search of LIKE or of Full-Text Search. Semantic Search lets you query for 3 types of information about your documents:

Key phrases. Return the statistically significant phrases in each document.
Similar documents. Return documents or rows that are similar or related, based on the key phrases in each document.
Why documents are similar. Return the key phrases that explain why 2 documents were identified as similar.

Semantic Search provides a rowset function for each of these 3 types of queries. Before we dive into the subject of this post – which describes how to do something that you can’t do with the built-in functions – let’s take a quick look at each of these capabilities.

The examples in this post use a collection of SQL Server white papers published on MSDN and TechNet – mostly in Microsoft Word format – saved into a SQL Server database by using a SQL Server 2012 FileTable . With this approach, I can easily save BLOBs into the database simply by dragging and dropping the files in Windows Explorer. The database table is named WhitePapers, and it has the fixed schema common to all FileTables .

Key phrases

To return the key phrases in a document, call the rowset function semantickeyphrasetable and provide the following:

The name of the indexed table that you want to query.
The indexed column or columns that you want to query.
Optionally, the ID of a single document or row.

In the following example, we use the title of a sample document to get its document ID from the FileTable. We also use a wildcard to query all the columns that are indexed for semantic search.

 DECLARE @Title nvarchar(255)
DECLARE @DocumentID hierarchyid

-- We'll use the published white paper "We Loaded 1TB in 30 Minutes with SSIS."
SET @Title = '1TBin30MwithSSIS.docx'

-- Get the ID of the document from its title.
SELECT @DocumentID = path_locator
FROM WhitePapers
WHERE name = @Title

-- Get the top 5 key phrases in the document.
SELECT TOP (5) keyphrase AS 'Key Phrase', score AS 'Score'
FROM semantickeyphrasetable(WhitePapers, *, @DocumentID)
ORDER BY score DESC
GO

This query returns the following statistically significant key phrases for the selected white paper. (In version 1 of semantic search, only one-word terms are supported.)

 Key Phrase       Score
---------------- -------------
sql              0.6475008
tcp              0.6415842
etl              0.6152592
numa             0.5434772
tpc              0.5183253

(5 row(s) affected)

Why documents are similar

To return the common key phrases that explain why 2 documents are identified as similar, call the rowset function semanticsimilaritydetailstable and provide the following:

The name of the indexed table that you want to query.
The IDs of the source document and of the matched (similar) document.
The indexed columns in the source row and in the matched row whose values you want to compare (often the same column).

In the following example, we ask for the key phrases which explain why the document “ScaleUpDWinSQL2008.docx” was identified in the preceding query as the document that’s most closely related to our source document about SSIS performance. The indexed content of the documents is stored in the file_stream column of our FileTable.

 DECLARE @Title1 nvarchar(255)
DECLARE @Document1ID hierarchyid
DECLARE @Document2ID hierarchyid

-- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You."
SET @Title1 = '1TBin30MwithSSIS.docx'

-- Get the ID of the source document from its title.
SELECT @Document1ID = path_locator
FROM WhitePapers
WHERE name = @Title1

-- Get the ID of the top matching document.
SELECT TOP (1) @Document2ID = matched_document_key
FROM semanticsimilaritytable(WhitePapers, *, @Document1ID)
ORDER BY score DESC

-- Now get the reasons for the match.
SELECT TOP (5) keyphrase AS 'Key Phrase', score as 'Score'
FROM semanticsimilaritydetailstable
(WhitePapers, file_stream, @Document1ID, file_stream, @Document2ID)
ORDER by score DESC
GO

This query identifies the top key phrases that are common to both documents:

 Key Phrase       Score
---------------- -------------
sql              0.5141976
etl              0.3497615
server           0.241514
cpus             0.2336491
cpu              0.2166376

(5 row(s) affected)

For each similar document, tell me why it’s similar

Now we have enough background to get to the subject of this post!

It’s natural to want to ask this question: “For each document that’s similar to my source document, tell me why it’s similar.” We don’t provide a built-in function to support this set-based query. But you can achieve this iterative functionality concisely by using CROSS APPLY.

First, let’s create a table-valued function that returns semantic similarity details for a single pair of documents. This function will serve as the target of the CROSS APPLY.

 CREATE FUNCTION semanticsimilaritydetails_foreach
    (
    @Document1ID hierarchyid,
    @Document2ID hierarchyid,
    @PhraseCount smallint
    )
RETURNS TABLE
AS
RETURN 
(
    SELECT TOP (@PhraseCount) * FROM
       semanticsimilaritydetailstable
       (
       WhitePapers,
       file_stream, @Document1ID,
       file_stream, @Document2ID
       )
       ORDER BY score DESC
)
GO

Now let’s use this new function with CROSS APPLY. @Document1ID contains the path_locator value of the source document that we’ve used in the preceding examples. In the WHERE clause, we make sure that we're not comparing the source document to itself.

 DECLARE @DetailsCount smallint = 3;

SELECT WP.name AS ‘Document’,
              SSD.keyphrase AS ‘Key Phrase’,
              SSD.score AS ‘Score’
FROM WhitePapers as WP
CROSS APPLY semanticsimilaritydetails_foreach
       (@Document1ID, WP.path_locator, @DetailsCount) as SSD
WHERE WP.path_locator <> @Document1ID
GO

Here’s a portion of the results. They show not only each document that’s similar or related to our source document, but also the top 3 key phrases that explain why the 2 documents are similar.

 Document                                                         Key Phrase       score
---------------------------------------------------------------- ---------------- -------------
ApplicationAndMulti-ServerManagement.docx                        sql              0.6475008
ApplicationAndMulti-ServerManagement.docx                        server           0.4185843
ApplicationAndMulti-ServerManagement.docx                        cpu              0.2633764
AuditingSQLServer2008.docx                                       sql              0.6475008
AuditingSQLServer2008.docx                                       server           0.3523027
AuditingSQLServer2008.docx                                       sqlserver        0.2144429
Automation of Data Mining Using Integration Services.docx        sql              0.3925959
Automation of Data Mining Using Integration Services.docx        msdn             0.1967144
Automation of Data Mining Using Integration Services.docx        aspx             0.191946

That’s it! Now you know how to iterate over a set of documents that are similar to a source document, and learn why each matching document is listed. You can probably find additional ways to parameterize or enhance these simple examples.

Be sure to check out the killer demo of FileTable and Semantic Search from SQL Server evangelist Roger Doherty and his team. This example includes a tag cloud to visualize the relative importance of key phrases and to find similar documents at a glance.

Looping over document similarity details in Semantic Search

Key phrases

Similar documents

Why documents are similar

For each similar document, tell me why it’s similar

Additional resources