Regular Expressions in T-SQL


Given that a recent blog of mine lamented the dangers of using xprocs, I thought it would be a good time to show some of the wonderful stuff you can do thanks to SQL Server extensibility features such as xprocs and COM objects.  It’s not that xprocs or in-process COM objects are inherently evil — it’s just that they can be misused and that they can be challenging for the average developer to code properly.


In today’s blog, I’ll show you how to filter T-SQL queries using Regular Expressions.  Regular Expressions, you’ll recall, allow sophisticated string searching and matching that goes beyond simple wildcards.  T-SQL’s LIKE operator (and PATINDEX() function, which has similar functionality) supports basic wildcards and some simple pattern matching, but has never had anything approaching Regular Expression support.  People who’ve written much T-SQL have no doubt encountered situations where they needed string searches that exceeded the meager capabilities of LIKE.


Lurking on any machine that has Windows Scripting Host installed (virtually all machines these days, although scripting can be disabled), is a powerful Regular Expressions facility, the VBScript.RegExp scripting object.  You can get to it from any COM client that supports the IDispatch interface.  IDispatch, you’ll recall, is COM’s popular late-binding interface – it allows applications to use COM components without knowing anything about them at compile-time.  In T-SQL, we get to IDispatch via the sp_OA stored procedures.  Via a simple UDF, we can access the RegExp object as though it were part of T-SQL:


use pubs


GO


DROP function dbo.fn_regex


GO


CREATE FUNCTION


dbo.fn_regex(@pattern varchar(255), @matchstring varchar(8000))


RETURNS int


AS


BEGIN


       declare @obj int


       declare @res int


       declare @match bit


       set @match=0


 


       exec @res=sp_OACreate ‘VBScript.RegExp’,@obj OUT


       IF (@res <> 0) BEGIN


              RETURN NULL


       END


 


       exec @res=sp_OASetProperty @obj, ‘Pattern’, @pattern


       IF (@res <> 0) BEGIN


              RETURN NULL


       END


 


       exec @res=sp_OASetProperty @obj, ‘IgnoreCase’, 1


       IF (@res <> 0) BEGIN


              RETURN NULL


       END


 


       exec @res=sp_OAMethod @obj, ‘Test’,@match OUT, @matchstring


       IF (@res <> 0) BEGIN


              RETURN NULL


       END


 


       exec @res=sp_OADestroy @obj


       return @match


END


GO


The UDF above does several interesting things. Note the fact that we call the sp_OA procs directly from our function. If you’ve done much UDF coding, you’re probably aware of the fact that you can’t call regular stored procedures from a UDF. Fortunately for us, although the sp_OA procs are prefixed with “sp_”, they’re actually extended procedures, which you can call from a UDF. Equally fortunate is the fact that they aren’t “spec procs”—extended procedures implemented internally by the server. Their entry points are in ODSOLE70.DLL, so they’re callable from a UDF just like any other regular xproc.


The algorithm we use here is embarrassingly simple:  We create the object, set some properties, then call the Test method to see whether we have a match.  As the code below illustrates, once we’ve wrapped our Regular Expression functionality in a UDF, we can use it to filter a query.


 


SELECT au_lname


FROM authors


WHERE dbo.fn_regex(‘G.*’,au_lname)<>0


 


au_lname


Green


Greene


Gringlesby


Ringer


Ringer


Straight


Stringer


 


As you can see, you don’t need SQLCLR or anything beyond SQL Server 2000 to add this powerful functionality to T-SQL.  And, contrary to what I would have guessed, this technique is surprisingly fast – I think most users would find the performance quite acceptable, especially for smaller tables.  You’ll have to try it yourself to see how it works in your environment, but it was not nearly as slow as I expected a technique built around constructing and tearing down a COM object with each search iteration to be.


So, even though xprocs and COM objects can be abused and can certainly cause problems when not coded properly, on balance, SQL Server, coupled with the objects and facilities lying around on most users’ machines, offers some great extensibility and power with minimal effort.


 


This technique first appeared in my last book, The Guru’s Guide to SQL Server Architecture and Internals.  That book has lots of additional details about Regular Expression use from T-SQL, including how to do so using the .NET Framework’s Regex object rather than VBScript.RegExp.  See it for additional Regular Expression search techniques as well as for more info on SQL Server’s SQLOLE facility, the component that makes all of this possible.


Comments (13)

  1. Can you use regular expressions in a WHERE clause?

    Ken Henderson shows us how, in this recent blog…

  2. Clifford Dibble says:

    This code will not run on Yukon.

    Period.

    1) ‘allow updates’ is a NOP since direct catalog updates are no longer allowed.

    2) There is no more system_function_schema. It was never documented anyway.

    It is very unwise to use undoc’ed internal APIs. Don’t expect code like this to port to Yukon.

    cdibble@microsoft.com

    SQL Engine PM

  3. Andriy Khwyshchun says:

    I tried it in SQL 2k an 2005. 2k – succeeded, 2005 – got message:

    Msg 15501, Level 16, State 1, Line 1

    SQL Server blocked access to procedure ‘sys.sp_OACreate’ of component ‘Ole Automation Procedures’ because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of ‘Ole Automation Procedures’ by using sp_configure. For more information about enabling ‘Ole Automation Procedures’, see "Surface Area Configuration" in SQL Server Books Online.

  4. David Keaveny says:

    Great article – this sort of functionality is very useful; a lot of the time, LIKE’s selection of pattern matching just doesn’t do what you need. I’ve already used it to identify records with garbage data in them.

    However, this approach doesn’t scale too happily, as a new COM object is created per row. I found another article that builds on this idea, but reduces the number of object references that need to be created: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=27205

    Working against my development database with 250K, the original query took almost 7 minutes; the revised query took less than 2 minutes, so it’s a worthwhile performance improvement.

  5. In SQL 2005 this can be done easily with the SQL-CLR. I use a regular expression UDF as an example in a newsletter. http://www.novicksoftware.com/coding-in-sql/Vol3/cis-v3-N13-dot-net-clr-in-sql-server.htm

    I’m pretty sure the regular expression version is going to be faster than creating a com object.

  6. This way you don’t have to create COM object multiple times…

    set quoted_identifier on

    go

    set ansi_nulls on

    go

    create function Regex(@rx int, @matchstring varchar(150)) returns int

    as

    begin

    declare @match bit

    — execute "Test(…)" method

          exec sp_OAMethod @rx, ‘Test’,@match out, @matchstring

    — this value is greater than zero upon succesfull match

    return @match

    end

    go

    create proc RegexTest

    as

    begin

    set nocount on

    declare

    @rx int,

    @hr int,

    @pattern varchar(75)

    set @pattern = ‘DAV.*’

    — create Regex object and set properties

    exec @hr=sp_OACreate ‘VBScript.RegExp’, @rx OUT

      exec @hr=sp_OASetProperty @rx, ‘Pattern’, @pattern

         exec @hr=sp_OASetProperty @rx, ‘IgnoreCase’, 1

    — execute regular expression matching pattern using Regex UDF

    select * from tableName where Regex(@rx, columnName)>0

    — alternative select (much more inefficient than Regex)

    — select * from tableName where columnName like ‘DAV%’

    — always call destroy object

    exec @hr=sp_OADestroy @rx

    — test

    — exec RegexTest

    end

    go

    set quoted_identifier on

    go

    set ansi_nulls on

    go

  7. Dan Mabee says:

    we’ve written similar code.  especially important is the bi David Mumladze posted about factoring out the com create/destroy code if performance on medium to large tables is expected.

    something else worth noting is that with this approach indexes are useless.  since SQL doesn’t know about the implementation of the function, it’s hands are pretty tied and it has to hand over all values resulting in a table scan every time (unless some other criteria helps knock down the record count 1st.)

    we generally recommend to our customers that they use features like like as much as possible and only use regex when nothing else will do.

    great blog, btw.  love the bits about xp procs.  it explains a great deal about some behavior we’ve seen.  we’ve got a bunch of xp procs that work really well, but i’d still love to get the chance to convert them to clr flavor.

  8. mike says:

    THANKS KEN!! finally regex on sql server that I was actually able to make work!!!

  9. Anderson Fortaleza says:

    Fantastic !! Thank you so much !

  10. Frankus says:

    I need help in other sentence. I want know how many if five or more letters be repetead in a word.

    You can help me in this??