Making regex less painful



So, Regex is darn powerful.  But they’re hard to write correctly, hard to read, and hard to verify.   To wit:


 


Regex repositories around the ‘net, where you can find regexes instead of writing them.  (google)


 


Regex tools – Eric’s Regex Workbench, Codeproject’s Expresso, etc.


 


Regex bugs – consider name C14N bugs, many of which lead to security issues.


 


Anyway, they’re hard.


 


Cyrus talks about the idea of making it easy to use domain specific language embedded in C#.  Today we revert to strings:


 


Regex email = new Regex("^(?=[a-z])[a-z0-9_.-]*(?=[a-z])[a-z]*@ ");


 


sqlDataAdapter.SelectCommand.CommandText = "SELECT * FROM customers WHERE zip_code = '98052'";


 


We lose type safety, compile-time syntax checking, performance, and readability.


 


One approach is to create a mapping from the domain language to the native language.  In some sense, that’s what programming is all about – mapping the real world into C#.


 


I did this with regexes a while back, and thought it might be an interesting thing for the blog:


 



      static class RegexBuilder


      {


            public static string QUOTE = "\"";


            public static string GROUP(string s) { return '(' + s + ')'; }


            public static string OR(string left, string right) { return GROUP(left + '|' + right); }


            public static string NOT(string s) { return "[^" + s + "]"; }


            public static string OPTIONAL(string s) { return GROUP(s) + '?'; }


            public static string ONEORMORE(string s) { return GROUP(s) + '+'; }


      }


 


I even did this TDD, so here are tests:



        [TestFixture]


        public class RegexBuilderTests: Assertion


            {


                  void VerifyMatch(string regex, string s)


                  {


                        Regex r = new Regex("^" + regex + "$", RegexOptions.IgnorePatternWhitespace);


                        Match m = r.Match(s);


 


                        Assert(m.Success);


                        Assert(m.Groups.Count >= 1);


                        Assert(m.Groups[0].Value == s);


                  }


                  void VerifyNoMatch(string regex, string s)


                  {


                        Regex r = new Regex("^" + regex + "$");


                        Match m = r.Match(s);


 


                        Assert(!m.Success);


                  }


                  [Test]


                  public void QUOTE()


                  {


                        AssertEquals(RegexBuilder.QUOTE, @"""");


                  }


                  [Test]


                  public void GROUP()


                  {


                        AssertEquals(RegexBuilder.GROUP("a"), "(a)");


                  }


                  [Test]


                  public void OR()


                  {


                        AssertEquals(RegexBuilder.OR("a", "b"), @"(a|b)");


                  }


                  [Test]


                  public void NOT()


                  {


                        AssertEquals(RegexBuilder.NOT("a"), @"[^a]");


                  }


                  [Test]


                  public void OPTIONAL()


                  {


                        AssertEquals(RegexBuilder.OPTIONAL("a"), @"(a)?");


                  }


                  [Test]


                  public void ONEORMORE()


                  {


                        AssertEquals(RegexBuilder.ONEORMORE("a"), @"(a)+");


                  }


            }


 


Here’s an example usage, where I built up the regex for C# Verbatim String Literals:


 



            // these are from the C# spec, which suites us well


            public static string SingleVerbatimStringLiteralCharacter = RegexBuilder.NOT(RegexBuilder.QUOTE);


            public static string QuoteEscapeSequence = RegexBuilder.QUOTE + RegexBuilder.QUOTE;


            public static string VerbatimStringLiteralCharacter = RegexBuilder.OR(SingleVerbatimStringLiteralCharacter, QuoteEscapeSequence);


            public static string VerbatimStringLiteralCharacters = RegexBuilder.ONEORMORE(VerbatimStringLiteralCharacter);


            public static string quotedStringRegex = RegexBuilder.QUOTE + VerbatimStringLiteralCharacters + RegexBuilder.QUOTE;


 

Comments (7)
  1. Nat says:

    How can you miss goodies from Roy?

    http://regulator.sourceforge.net/

  2. Naginata says:

    Yes, I use regex constantly, and the regulator is the best program I’ve ever seen for manipulating them. It is the omega program, the one program from which all other programs are forged.

    … ok, it’s good, that’s what I’m saying.

  3. Paulo Morgado says:

    I,ve heard about differeces between Regex in .NET and ECMAscript.

    Is this true? What are they?

    Paulo Morgado

  4. Paulo,

    .NET regexes are mostly a superset of ECMAScript. There are some minor differences, but it’s possible to turn on an Ecmascript mode.

    I’ll also vote for the regulator. My workbench was unique in doing regex analysis for a while, but I gave that code to Roy a while back.

  5. Morton says:

    I have not tried the regex "helpers", so I do not know what they offer. Anyway, why not extend intellisense into supporting writing regex. Nothing fancy, just display a list of possible tags and when hovering over them, display longer description and example or whatever.

  6. Morton says:

    Forgot to say this. I’d rather prefer if the IDE helped you teach the language instead of changing how it is written. So writing regex would still be like before, now you could just get additional help while at it, and perhaps offer possibility to expand existing regex into more verbal presentation.

Comments are closed.

Skip to main content