Making regex less painful

Article
07/01/2004

So, Regex is darn powerful. But they’re hard to write correctly, hard to read, and hard to verify. To wit:

Regex repositories around the ‘net, where you can find regexes instead of writing them. (google)

Regex tools – Eric’s Regex Workbench, Codeproject’s Expresso, etc.

Regex bugs – consider name C14N bugs, many of which lead to security issues.

Anyway, they’re hard.

Cyrus talks about the idea of making it easy to use domain specific language embedded in C#. Today we revert to strings:

Regex email = new Regex("^(?=[a-z])[a-z0-9_.-]*(?=[a-z])[a-z]*@ ");

sqlDataAdapter.SelectCommand.CommandText = "SELECT * FROM customers WHERE zip_code = '98052'";

We lose type safety, compile-time syntax checking, performance, and readability.

One approach is to create a mapping from the domain language to the native language. In some sense, that’s what programming is all about – mapping the real world into C#.

I did this with regexes a while back, and thought it might be an interesting thing for the blog:

static class RegexBuilder

{

public static string QUOTE = "\"";

public static string GROUP(string s) { return '(' + s + ')'; }

public static string OR(string left, string right) { return GROUP(left + '|' + right); }

public static string NOT(string s) { return "[^" + s + "]"; }

public static string OPTIONAL(string s) { return GROUP(s) + '?'; }

public static string ONEORMORE(string s) { return GROUP(s) + '+'; }

}

I even did this TDD, so here are tests:

[TestFixture]

public class RegexBuilderTests: Assertion

{

void VerifyMatch(string regex, string s)

{

Regex r = new Regex("^" + regex + "$", RegexOptions.IgnorePatternWhitespace);

Match m = r.Match(s);

Assert(m.Success);

Assert(m.Groups.Count >= 1);

Assert(m.Groups[0].Value == s);

}

void VerifyNoMatch(string regex, string s)

{

Regex r = new Regex("^" + regex + "$");

Match m = r.Match(s);

Assert(!m.Success);

}

[Test]

public void QUOTE()

{

AssertEquals(RegexBuilder.QUOTE, @"""");

}

[Test]

public void GROUP()

{

AssertEquals(RegexBuilder.GROUP("a"), "(a)");

}

[Test]

public void OR()

{

AssertEquals(RegexBuilder.OR("a", "b"), @"(a|b)");

}

[Test]

public void NOT()

{

AssertEquals(RegexBuilder.NOT("a"), @"[^a]");

}

[Test]

public void OPTIONAL()

{

AssertEquals(RegexBuilder.OPTIONAL("a"), @"(a)?");

}

[Test]

public void ONEORMORE()

{

AssertEquals(RegexBuilder.ONEORMORE("a"), @"(a)+");

}

Here’s an example usage, where I built up the regex for C# Verbatim String Literals:

// these are from the C# spec, which suites us well

public static string SingleVerbatimStringLiteralCharacter = RegexBuilder.NOT(RegexBuilder.QUOTE);

public static string QuoteEscapeSequence = RegexBuilder.QUOTE + RegexBuilder.QUOTE;

public static string VerbatimStringLiteralCharacter = RegexBuilder.OR(SingleVerbatimStringLiteralCharacter, QuoteEscapeSequence);

public static string VerbatimStringLiteralCharacters = RegexBuilder.ONEORMORE(VerbatimStringLiteralCharacter);

public static string quotedStringRegex = RegexBuilder.QUOTE + VerbatimStringLiteralCharacters + RegexBuilder.QUOTE;

Making regex less painful

Additional resources