What’s wrong with this code, part 8 – the answers.

Yesterday's post was a snippet of code that was supposed to validate an email address provided by the caller.

As such, it wasn't actually that bad, and for most email addresses, it does the right thing.  This example is actually better than many email address validators I've found (some don't allow forms like user@subdomain.domain.com, which is patently legal).

But it doesn't really validate RFC 2822 email addresses.  Instead, it validates COMMON email addresses.  It turns out that it's not really possible to validate all the legal 2822 email addresses in a regular expression.

Here's the grammar for valid email addresses from RFC 2822 (ignoring the obsolete definitions).   Btw, this may not be complete - I tried to pull the relevant pieces from RFC2822, but...:

addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string
domain          =       dot-atom / domain-literal
domain-literal  =       [CFWS] "[" *([FWS] dcontent) [FWS] "]" [CFWS]
dcontent        =       dtext / quoted-pair
dtext           =       NO-WS-CTL /     ; Non white space controls
                        %d33-90 /       ; The rest of the US-ASCII
                        %d94-126        ;  characters not including "[",
                                        ;  "]", or "\"
atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
atom            =       [CFWS] 1*atext [CFWS]
dot-atom        =       [CFWS] dot-atom-text [CFWS]
dot-atom-text   =       1*atext *("." 1*atext)
text            =       %d1-9 /         ; Characters excluding CR and LF
                        %d11 /
                        %d12 /
quoted-pair     =       ("\" text)
NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters

The key thing to note in this grammar is that the local-part is almost free-form when it comes to the local part.  And there are characters allowed in the local part like !, *, $, etc that are totally legal according to RFC2822 that aren't allowed.

So the validation routine in question won't accept a perfectly legal email address like "Foo!Bar@MyDomain.Org".

Kudos: Uwe Keim was the first person to catch the big issue (although Denny came REALLY close), in the first two comments on the article.

KCS also pointed out an inefficiency in the code - it uses "if (boolean) return true else return false", which is redundant.

Non Bugs: Tad pointed out that the code doesn't check for a null emailAddress.  Anyone who's been reading my blog for long enough would realize that I don't consider issues like that that to be bugs (and, in fact, I consider checking for those cases to be bugs).  The strongest change I'd make towards validating emailAddress is to put a Debug.Assert(emailAddress != null) in the code - it's a bug in the caller if they pass in a null emailAddress.

philoserf pointed out that the code "is good enough", and he's right.  On the other hand, Aaron Robinson pointed out that people in the .info and .museum top level domains are gonna be pretty unhappy.

Adi Oltean pointed out that V2 of the .Net framework contains the System.Net.MailAddress class which contains a built-in validator.  Very cool.

Comments (5)

  1. So does the .NET System.Net.MailAddress validator correctly handle all of these test cases? 🙂

  2. Swamp Justice,

    You know, I don’t know.

    I suspect that it should, but I don’t know for certain.

  3. Eric Newton says:

    I hope the MailAddress validator will have a registry key or machine.config setting for the regex

  4. > it’s a bug in the caller if they pass in a

    > null emailAddress

    But they’re calling an e-mail address validator. They’re using the e-mail address validator for a purpose that looks like its intended purpose. If you want users to know that your routine does something almost like what looks like its intended purpose, you need to have a spec in big red letters that no one can miss when seeing it:

    Returns true if e-mail address appears to be either syntactically valid or null.

    Returns false if e-mail address appears to be either syntactically invalid other than null.

  5. Casey Schmit says:

    "Anyone who’s been reading my blog for long enough would realize that I don’t consider issues like that that to be bugs (and, in fact, I consider checking for those cases to be bugs)."

    I’ve been reading for a while now(keep up the great work), but haven’t gotten the gist about checking params being a bug. This is definitely one of those issues where people seem to differ significantly(always check versus never check). I’m curious to hear your reasoning against checking them. If there is a comment already posted about it, just point me to it.

Skip to main content