Regular Expressions in T-SQL, Part II

I've gotten quite a bit of mail from people re: my recent post about regular expression searches in T-SQL.  This was actually covered in my last book, so I've been surprised at how many people were unaware that the tools needed to do this were right under their noses.  (You mean you haven't read every book I've ever written cover-to-cover?!  I'm shocked!  Stunned!  Aghast!)  Today, I'll elaborate a bit more on what I've said before on the subject.

First, it should go without saying that any technique that involves creating a COM object, calling a method on it via IDispatch, then destroying it, will have perf challenges when working with lots of data.  Whether those are acceptable to you or not depends on your needs.  You will simply have to test it.  Your mileage may vary.  It's surprisingly fast given what it's doing, but you shouldn't expect to get the functionality for free in terms of performance.  It's not magic.  I thought this was implicit, but I've gotten enough email about it that I want to point it out explicitly:  creating a COM object, interacting with it via IDispatch, and destroying it with each iteration through a search takes time.  Like any new technique, you should fully test it with your data and in your system(s) before assuming it will meet your needs.  BTW, there are ways to avoid having to create and tear it down with each iteration, but I won't go into that here because there's a way to avoid using a COM object altogether, and that brings me to my second point.

Second, if it's not obvious, SQL Server 2005's SQLCLR functionality was born for this type of problem.  I've gone out of my way not to talk much about SS2K5 thus far in this blog because it's not released yet.  Others are more comfortable with that.  More power to them.  For me, things could change -- perhaps dramatically -- between now and when the product is released, so I probably won't be saying much about Yukon until it's actually out.

That said, you may be wondering how you could write a SQLCLR function that does regular expression searches.  You don't have to.  There's already one included with the betas.  If you have the CTP version from April, you already have a SQLCLR-based regular expression function.  Check out the StringUtilities project in the \Program Files\Microsoft SQL Server\90\Samples\Engine\Programmability\CLR\StringUtilities\CS\StringUtilities folder (or the VB equivalent if you're a VB person).  There's a regex class in RegularExpression.cs.  Once you've built the assembly, you can install and test it via the T-SQL scripts in the \Program Files\Microsoft SQL Server\90\Samples\Engine\Programmability\CLR\StringUtilities\Scripts folder.

Here's a sample query that makes use of it:

SELECT

v.VendorID, v.Name

FROM

Purchasing.Vendor v

WHERE

EXISTS (SELECT * FROM dbo.RegexMatches(v.Name, '(\w*\w*)\s(\w*L\w)'))

This one searches Purchasing.Vendor for vendors with a Name column value that consists of two words, the second containing an L. Based on my testing, this looks to be dramatically faster than the COM approach, so there's a lot of promise here.

Of course, SQLCLR opens up all sorts of possibilities along these lines. Do be careful, though. Remember that you're writing code that runs inside the engine. It if causes problems, your neighbors in the SQL Server ecosystem might not be too neighborly. As someone who's seen more than his share of problems caused by custom code running in process, I try to be careful when doing stuff like this.