Working with Regular Expressions

One of the many benefits of using CLR Integration in your database applications is the availability of rich .NET Framework libraries. SQL Server provides you with a set of built-in functions that can be used in your queries but with CLR Integration you can enhance this set to any functionality from the list of supported framework libraries for e.g. System.dll.

One such very useful functionality that this enables is to match patterns using regular expressions. You can use System.Text.RegularExpressions to write a few lines of code that can do pattern matching. The following C# code can be used as a function in your queries (once it is deployed to your database) to perform pattern matching:

using System;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

 

public partial class RegExBase
{
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool RegExMatch(string pattern, string matchString)
{
Regex r1 = new Regex(pattern.TrimEnd(null));
return Regex.Match(matchString.TrimEnd(null)).Success;
}
};

You can now use it in your queries.

Following query returns rows from a table Person.Contact that contain valid email addresses:

select ContactID, FirstName, LastName, EmailAddress, Phone
from Person.Contact
where dbo.RegExMatch('[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu|nz|au)', EmailAddress) = 1

Another good use of regular expression function is as a constraint in your table. Suppose you want to insert only those rows that have valid email addresses and phone numbers. You can do this as follows:

create table Contacts (
FirstName nvarchar(30),
LastName nvarchar(30),
EmailAddress nvarchar(30) CHECK (dbo.RegExMatch('[a-zA-Z0-9_\-]+@([a-zA-Z0-9_\-]+\.)+(com|org|edu|nz)', EmailAddress)=1),
USPhoneNo nvarchar(30) CHECK (dbo.RegExMatch('\([1-9][0-9][0-9]\) [0-9][0-9][0-9]\-[0-9][0-9][0-9][0-9]', UsPhoneNo)=1))

Now coming back to the code we wrote in our function. You might notice that we are creating an instance of the regular expression class for each invocation. In the above query, if your table has a million rows, this function would be called a million times and hence creating an instance a million times which could be expensive and hurt performance. SQL Server allows you to use readonly static variables in your code. We can thus change the above code to have the email address regular expression created in a readonly static as follows:

using System;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class RegExStatic
{
private readonly static Regex r1 = new Regex("[a-zA-Z0-9_\\-]+@([a-zA-Z0-9_\\-]+\\.)+(com|org|edu|nz|au)");

[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool validateEmailStatic(string matchString)
{
return r1.Match(matchString.TrimEnd(null)).Success;
}
};

This can be used in your query as follows:

select * from Person.Contact where
dbo.validateEmailStatic(EmailAddress) = 1

The above code would perform better for large data since it would cache the instance created during the first invocation and use it in future. However, as you can notice, this requires you to know your pattern at the time of writing your code. Many times this might not be an option or you might not want to write one method for every regular expression that you want to use. Regular expression library has an option that does this kind of caching for you. You can use the RegexOptions.Compiled option to enable this caching. This option would compile the regular expression once and use in future invocations for the same regular expression. However, this caching is currently available only in the static Match method on Regex. Thus, this allows you to have the flexibility of passing arbitrary patterns to your method and still not loose performance.

Following is the code to do this:

using System;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class RegExCompiled
{
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool RegExCompiledMatch(string pattern, string matchString)
{
return Regex.Match(matchString.TrimEnd(null), pattern.TrimEnd(null), RegexOptions.Compiled).Success;
}
};

There is another option where you can pre compile your pattern into an assembly and then use it in your methods. This could be useful for complex patterns that take a lot of time to compile. Following code shows you how to compile your expression into an assembly:

using System;
using System.Reflection;
using System.Text.RegularExpressions;

public class RegexAsm {
public static void Main()
{
RegexCompilationInfo rci = new RegexCompilationInfo("[a-zA-Z0-9_\\-]+@([a-zA-Z0-9_\\-]+\\.)+(com|org|edu|nz|au)", RegexOptions.None, "EmailRegex", "", true);
AssemblyName an = new AssemblyName();
an.Name = "EmailRegex";
Regex.CompileToAssembly(new RegexCompilationInfo[] { rci }, an);
}
}

Now you can define a SQL CLR function that uses the assembly created by the above code as follows:

 

using System;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;

public partial class RegExAssembly
{
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static bool validateEmailCompiled(string matchString)
{
EmailRegex emailRegex = new EmailRegex();
return emailRegex.Match(matchString.TrimEnd(null)).Success;
}
};

Today we saw how a few lines of CLR code can enable a great functionality in your queries and let you do things that were very difficult or impossible in the past. Continue to watch this space for more useful information coming in soon.

- Vineet Rao

Microsoft SQL Server

This posting is provided "AS IS" with no warranties, and confers no rights.