Making regex less painful

So, Regex is darn powerful. But they’re hard to write correctly, hard to read, and hard to verify. To wit:

Regex repositories around the ‘net, where you can find regexes instead of writing them. (google)

Regex tools – Eric’s Regex Workbench, Codeproject’s Expresso, etc.

Regex bugs – consider name C14N bugs, many of which lead to security issues.

Anyway, they’re hard.

Cyrus talks about the idea of making it easy to use domain specific language embedded in C#. Today we revert to strings:

Regex email = new Regex("^(?=[a-z])[a-z0-9_.-]*(?=[a-z])[a-z]*@ ");

sqlDataAdapter.SelectCommand.CommandText = "SELECT * FROM customers WHERE zip_code = '98052'";

 

We lose type safety, compile-time syntax checking, performance, and readability.

One approach is to create a mapping from the domain language to the native language. In some sense, that’s what programming is all about – mapping the real world into C#.

I did this with regexes a while back, and thought it might be an interesting thing for the blog:

      static class RegexBuilder

      {

            public static string QUOTE = "\"";

            public static string GROUP(string s) { return '(' + s + ')'; }

            public static string OR(string left, string right) { return GROUP(left + '|' + right); }

            public static string NOT(string s) { return "[^" + s + "]"; }

            public static string OPTIONAL(string s) { return GROUP(s) + '?'; }

            public static string ONEORMORE(string s) { return GROUP(s) + '+'; }

      }

I even did this TDD, so here are tests:

        [TestFixture]

        public class RegexBuilderTests: Assertion

            {

                  void VerifyMatch(string regex, string s)

                  {

                        Regex r = new Regex("^" + regex + "$", RegexOptions.IgnorePatternWhitespace);

                        Match m = r.Match(s);

                        Assert(m.Success);

                        Assert(m.Groups.Count >= 1);

                        Assert(m.Groups[0].Value == s);

                  }

                  void VerifyNoMatch(string regex, string s)

                  {

                        Regex r = new Regex("^" + regex + "$");

                        Match m = r.Match(s);

                        Assert(!m.Success);

                  }

                  [Test]

                  public void QUOTE()

                  {

                        AssertEquals(RegexBuilder.QUOTE, @"""");

                  }

                  [Test]

                  public void GROUP()

                  {

                        AssertEquals(RegexBuilder.GROUP("a"), "(a)");

                  }

                  [Test]

                  public void OR()

                  {

                        AssertEquals(RegexBuilder.OR("a", "b"), @"(a|b)");

                  }

                  [Test]

                  public void NOT()

                  {

                        AssertEquals(RegexBuilder.NOT("a"), @"[^a]");

                  }

                  [Test]

                  public void OPTIONAL()

                  {

                        AssertEquals(RegexBuilder.OPTIONAL("a"), @"(a)?");

                  }

                  [Test]

                  public void ONEORMORE()

                  {

                        AssertEquals(RegexBuilder.ONEORMORE("a"), @"(a)+");

                  }

            }

Here’s an example usage, where I built up the regex for C# Verbatim String Literals:

            // these are from the C# spec, which suites us well

            public static string SingleVerbatimStringLiteralCharacter = RegexBuilder.NOT(RegexBuilder.QUOTE);

            public static string QuoteEscapeSequence = RegexBuilder.QUOTE + RegexBuilder.QUOTE;

            public static string VerbatimStringLiteralCharacter = RegexBuilder.OR(SingleVerbatimStringLiteralCharacter, QuoteEscapeSequence);

            public static string VerbatimStringLiteralCharacters = RegexBuilder.ONEORMORE(VerbatimStringLiteralCharacter);

            public static string quotedStringRegex = RegexBuilder.QUOTE + VerbatimStringLiteralCharacters + RegexBuilder.QUOTE;