Statistical Semantic Search, which is new in SQL Server 2012 “Denali,” indexes documents and text columns to support queries that go beyond the simple keyword search of LIKE or of Full-Text Search. Semantic Search lets you query for 3 types of information about your documents:
- Key phrases. Return the statistically significant phrases in each document.
- Similar documents. Return documents or rows that are similar or related, based on the key phrases in each document.
- Why documents are similar. Return the key phrases that explain why 2 documents were identified as similar.
Semantic Search provides a rowset function for each of these 3 types of queries. Before we dive into the subject of this post – which describes how to do something that you can’t do with the built-in functions – let’s take a quick look at each of these capabilities.
The examples in this post use a collection of SQL Server white papers published on MSDN and TechNet – mostly in Microsoft Word format – saved into a SQL Server database by using a SQL Server 2012 FileTable. With this approach, I can easily save BLOBs into the database simply by dragging and dropping the files in Windows Explorer. The database table is named WhitePapers, and it has the fixed schema common to all FileTables.
To return the key phrases in a document, call the rowset function semantickeyphrasetable and provide the following:
- The name of the indexed table that you want to query.
- The indexed column or columns that you want to query.
- Optionally, the ID of a single document or row.
In the following example, we use the title of a sample document to get its document ID from the FileTable. We also use a wildcard to query all the columns that are indexed for semantic search.
DECLARE @Title nvarchar(255) DECLARE @DocumentID hierarchyid -- We'll use the published white paper "We Loaded 1TB in 30 Minutes with SSIS." SET @Title = '1TBin30MwithSSIS.docx' -- Get the ID of the document from its title. SELECT @DocumentID = path_locator FROM WhitePapers WHERE name = @Title -- Get the top 5 key phrases in the document. SELECT TOP (5) keyphrase AS 'Key Phrase', score AS 'Score' FROM semantickeyphrasetable(WhitePapers, *, @DocumentID) ORDER BY score DESC GO
This query returns the following statistically significant key phrases for the selected white paper. (In version 1 of semantic search, only one-word terms are supported.)
Key Phrase Score ---------------- ------------- sql 0.6475008 tcp 0.6415842 etl 0.6152592 numa 0.5434772 tpc 0.5183253 (5 row(s) affected)
To return the documents or rows that are similar to another document or row, call the rowset function semanticsimilaritytable and provide the same values as we provided for the previous function. Here however it’s required to provide a source document for the 3rd argument.
In the following example, we assume the same document title and ID as in the preceding example. We JOIN on the source table to return document names in place of less friendly key values.
SELECT TOP (5) WP.name AS Document, SST.score AS Score FROM semanticsimilaritytable(WhitePapers, *, @DocumentID) AS SST JOIN WhitePapers AS WP ON SST.matched_document_key = WP.path_locator ORDER BY score DESC GO
This query returns the following white papers that are similar to the SSIS white paper:
Document Score -------------------------------- ------------- ScaleUpDWinSQL2008.docx 0.301263 FILESTREAMStorage.docx 0.2778505 WhyNotSybaseASE.docx 0.2710847 HAwithSQL2008.docx 0.265366 HAwithSQL2008R2.docx 0.2646715 (5 row(s) affected)
Why documents are similar
To return the common key phrases that explain why 2 documents are identified as similar, call the rowset function semanticsimilaritydetailstable and provide the following:
- The name of the indexed table that you want to query.
- The IDs of the source document and of the matched (similar) document.
- The indexed columns in the source row and in the matched row whose values you want to compare (often the same column).
In the following example, we ask for the key phrases which explain why the document “ScaleUpDWinSQL2008.docx” was identified in the preceding query as the document that’s most closely related to our source document about SSIS performance. The indexed content of the documents is stored in the file_stream column of our FileTable.
DECLARE @Title1 nvarchar(255) DECLARE @Document1ID hierarchyid DECLARE @Document2ID hierarchyid -- Source white paper - "We Loaded 1TB in 30 Minutes with SSIS, and So Can You." SET @Title1 = '1TBin30MwithSSIS.docx' -- Get the ID of the source document from its title. SELECT @Document1ID = path_locator FROM WhitePapers WHERE name = @Title1 -- Get the ID of the top matching document. SELECT TOP (1) @Document2ID = matched_document_key FROM semanticsimilaritytable(WhitePapers, *, @Document1ID) ORDER BY score DESC -- Now get the reasons for the match. SELECT TOP (5) keyphrase AS 'Key Phrase', score as 'Score' FROM semanticsimilaritydetailstable (WhitePapers, file_stream, @Document1ID, file_stream, @Document2ID) ORDER by score DESC GO
This query identifies the top key phrases that are common to both documents:
Key Phrase Score ---------------- ------------- sql 0.5141976 etl 0.3497615 server 0.241514 cpus 0.2336491 cpu 0.2166376 (5 row(s) affected)
For each similar document, tell me why it’s similar
Now we have enough background to get to the subject of this post!
It’s natural to want to ask this question: “For each document that’s similar to my source document, tell me why it’s similar.” We don’t provide a built-in function to support this set-based query. But you can achieve this iterative functionality concisely by using CROSS APPLY.
First, let’s create a table-valued function that returns semantic similarity details for a single pair of documents. This function will serve as the target of the CROSS APPLY.
CREATE FUNCTION semanticsimilaritydetails_foreach ( @Document1ID hierarchyid, @Document2ID hierarchyid, @PhraseCount smallint ) RETURNS TABLE AS RETURN ( SELECT TOP (@PhraseCount) * FROM semanticsimilaritydetailstable ( WhitePapers, file_stream, @Document1ID, file_stream, @Document2ID ) ORDER BY score DESC ) GO
Now let’s use this new function with CROSS APPLY. @Document1ID contains the path_locator value of the source document that we’ve used in the preceding examples. In the WHERE clause, we make sure that we’re not comparing the source document to itself.
DECLARE @DetailsCount smallint = 3; SELECT WP.name AS ‘Document’, SSD.keyphrase AS ‘Key Phrase’, SSD.score AS ‘Score’ FROM WhitePapers as WP CROSS APPLY semanticsimilaritydetails_foreach (@Document1ID, WP.path_locator, @DetailsCount) as SSD WHERE WP.path_locator <> @Document1ID GO
Here’s a portion of the results. They show not only each document that’s similar or related to our source document, but also the top 3 key phrases that explain why the 2 documents are similar.
Document Key Phrase score ---------------------------------------------------------------- ---------------- ------------- ApplicationAndMulti-ServerManagement.docx sql 0.6475008 ApplicationAndMulti-ServerManagement.docx server 0.4185843 ApplicationAndMulti-ServerManagement.docx cpu 0.2633764 AuditingSQLServer2008.docx sql 0.6475008 AuditingSQLServer2008.docx server 0.3523027 AuditingSQLServer2008.docx sqlserver 0.2144429 Automation of Data Mining Using Integration Services.docx sql 0.3925959 Automation of Data Mining Using Integration Services.docx msdn 0.1967144 Automation of Data Mining Using Integration Services.docx aspx 0.191946
That’s it! Now you know how to iterate over a set of documents that are similar to a source document, and learn why each matching document is listed. You can probably find additional ways to parameterize or enhance these simple examples.
Be sure to check out the killer demo of FileTable and Semantic Search from SQL Server evangelist Roger Doherty and his team. This example includes a tag cloud to visualize the relative importance of key phrases and to find similar documents at a glance.