Making regex less painful
So, Regex is darn powerful. But they’re hard to write correctly, hard to read, and hard to verify. To wit:
Regex repositories around the ‘net, where you can find regexes instead of writing them. (google)
Regex tools – Eric’s Regex Workbench, Codeproject’s Expresso, etc.
Regex bugs – consider name C14N bugs, many of which lead to security issues.
Anyway, they’re hard.
Cyrus talks about the idea of making it easy to use domain specific language embedded in C#. Today we revert to strings:
Regex email = new Regex("^(?=[a-z])[a-z0-9_.-]*(?=[a-z])[a-z]*@ ");
sqlDataAdapter.SelectCommand.CommandText = "SELECT * FROM customers WHERE zip_code = '98052'";
We lose type safety, compile-time syntax checking, performance, and readability.
One approach is to create a mapping from the domain language to the native language. In some sense, that’s what programming is all about – mapping the real world into C#.
I did this with regexes a while back, and thought it might be an interesting thing for the blog:
static class RegexBuilder
{
public static string QUOTE = "\"";
public static string GROUP(string s) { return '(' + s + ')'; }
public static string OR(string left, string right) { return GROUP(left + '|' + right); }
public static string NOT(string s) { return "[^" + s + "]"; }
public static string OPTIONAL(string s) { return GROUP(s) + '?'; }
public static string ONEORMORE(string s) { return GROUP(s) + '+'; }
}
I even did this TDD, so here are tests:
[TestFixture]
public class RegexBuilderTests: Assertion
{
void VerifyMatch(string regex, string s)
{
Regex r = new Regex("^" + regex + "$", RegexOptions.IgnorePatternWhitespace);
Match m = r.Match(s);
Assert(m.Success);
Assert(m.Groups.Count >= 1);
Assert(m.Groups[0].Value == s);
}
void VerifyNoMatch(string regex, string s)
{
Regex r = new Regex("^" + regex + "$");
Match m = r.Match(s);
Assert(!m.Success);
}
[Test]
public void QUOTE()
{
AssertEquals(RegexBuilder.QUOTE, @"""");
}
[Test]
public void GROUP()
{
AssertEquals(RegexBuilder.GROUP("a"), "(a)");
}
[Test]
public void OR()
{
AssertEquals(RegexBuilder.OR("a", "b"), @"(a|b)");
}
[Test]
public void NOT()
{
AssertEquals(RegexBuilder.NOT("a"), @"[^a]");
}
[Test]
public void OPTIONAL()
{
AssertEquals(RegexBuilder.OPTIONAL("a"), @"(a)?");
}
[Test]
public void ONEORMORE()
{
AssertEquals(RegexBuilder.ONEORMORE("a"), @"(a)+");
}
}
Here’s an example usage, where I built up the regex for C# Verbatim String Literals:
// these are from the C# spec, which suites us well
public static string SingleVerbatimStringLiteralCharacter = RegexBuilder.NOT(RegexBuilder.QUOTE);
public static string QuoteEscapeSequence = RegexBuilder.QUOTE + RegexBuilder.QUOTE;
public static string VerbatimStringLiteralCharacter = RegexBuilder.OR(SingleVerbatimStringLiteralCharacter, QuoteEscapeSequence);
public static string VerbatimStringLiteralCharacters = RegexBuilder.ONEORMORE(VerbatimStringLiteralCharacter);
public static string quotedStringRegex = RegexBuilder.QUOTE + VerbatimStringLiteralCharacters + RegexBuilder.QUOTE;