What's wrong with this code, part 8 - the answers.

Yesterday's post was a snippet of code that was supposed to validate an email address provided by the caller.

As such, it wasn't actually that bad, and for most email addresses, it does the right thing.  This example is actually better than many email address validators I've found (some don't allow forms like user@subdomain.domain.com, which is patently legal).

But it doesn't really validate RFC 2822 email addresses.  Instead, it validates COMMON email addresses.  It turns out that it's not really possible to validate all the legal 2822 email addresses in a regular expression.

Here's the grammar for valid email addresses from RFC 2822 (ignoring the obsolete definitions).   Btw, this may not be complete - I tried to pull the relevant pieces from RFC2822, but...:

addr-spec       =       local-part "@" domainlocal-part      =       dot-atom / quoted-stringdomain          =       dot-atom / domain-literaldomain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]dcontent        =       dtext / quoted-pairdtext           =       NO-WS-CTL /     ; Non white space controls                        %d33-90 /       ; The rest of the US-ASCII                        %d94-126        ;  characters not including "[",                                        ;  "]", or "\"atext           =       ALPHA / DIGIT / ; Any character except controls,                        "!" / "#" /     ;  SP, and specials.                        "$" / "%" /     ;  Used for atoms                        "&" / "'" /                        "*" / "+" /                        "-" / "/" /                        "=" / "?" /                        "^" / "_" /                        "`" / "{" /                        "|" / "}" /                        "~"atom            =       [CFWS] 1*atext [CFWS]dot-atom        =       [CFWS] dot-atom-text [CFWS]dot-atom-text   =       1*atext *("." 1*atext)text            =       %d1-9 /         ; Characters excluding CR and LF                        %d11 /                        %d12 /                        %d14-127quoted-pair     =       ("\" text)NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters                        %d11 /          ;  that do not include the                        %d12 /          ;  carriage return, line feed,                        %d14-31 /       ;  and white space characters                        %d127

The key thing to note in this grammar is that the local-part is almost free-form when it comes to the local part.  And there are characters allowed in the local part like !, *, $, etc that are totally legal according to RFC2822 that aren't allowed.

So the validation routine in question won't accept a perfectly legal email address like "Foo!Bar@MyDomain.Org".

Kudos: Uwe Keim was the first person to catch the big issue (although Denny came REALLY close), in the first two comments on the article.

KCS also pointed out an inefficiency in the code - it uses "if (boolean) return true else return false", which is redundant.

Non Bugs: Tad pointed out that the code doesn't check for a null emailAddress.  Anyone who's been reading my blog for long enough would realize that I don't consider issues like that that to be bugs (and, in fact, I consider checking for those cases to be bugs).  The strongest change I'd make towards validating emailAddress is to put a Debug.Assert(emailAddress != null) in the code - it's a bug in the caller if they pass in a null emailAddress.

philoserf pointed out that the code "is good enough", and he's right.  On the other hand, Aaron Robinson pointed out that people in the .info and .museum top level domains are gonna be pretty unhappy.

Adi Oltean pointed out that V2 of the .Net framework contains the System.Net.MailAddress class which contains a built-in validator.  Very cool.