EAI (Email Address Internationalization) address validation

I've been asked a few times how to verify how an international email address is well formed.  International email addresses have unicode in the local part of the address (before the @), or in the domain name, or both.

This problem is actually a bit trickier than many people expect because actual rules for email addresses can be defined by email servers.  For example, some servers may treat upper and lower case versions of the same word (eg: Shawn and shawn) as different mailboxes, and other servers may do some mapping and put them in the same mailbox.  Adnd the case mapping may not be the same for all servers: a Turkish server may treat I and i differently than an English server. With Unicode and various normalization forms, servers are free to do other normalizations and mappings as well.

Even non-Unicode email addresses are a bit tricky to validate, and some of the regular expressions used to attempt that can sometimes be quite long.  HTML 5 provides a simplified validation in https://www.w3.org/TR/html5/forms.html#e-mail-state-(type=email), so one approach is to use the HTML5 validation as a starting point and extend the local part to Unicode per RFC 6531.  Basically that means allowing characters > U+007F (or more practically above the C1 Control characters, U+009F).  For the domain part it's simple enough to check for a valid IDN name, though it'd probably be more correct to use something like the modern Windows.Networking.HostName class.  I kept with the IDN validation to be more portable.

Note that this validation doesn't guarantee that an address is a real email address.  Server rules could disallow the local part, or it may just be unassigned.  The domain part may even be a valid domain, though it could be well formed.  C# and C++ versions of the sample are in the attached zip file.  In case you don't want to bother with that, the guts of the C# example are here:

    static Regex eaiRegex = new Regex("^([a-zA-Z0-9.!#$%&'*+/=?^_`{|}~\u00A0-\uD7FF\uE000-\uFFFF-]|([\uD800-\uDBFF][\uDC00\uDFFF]))+$");

    static bool IsValidEmailAddress(string address)
    {
        // Chop into local and domain parts
        String[] parts = address.Split(<'@'>);

        // Needs to have 2 parts (local & domain)
        if (parts.Length != 2)
        {
            return false;
        }

        // Check local part
        // Email address validation is "hard", so we're following HTML5's logic and allowing EAI code points.
        // https://www.w3.org/TR/html5/forms.html#e-mail-state-(type=email)
        // Their expression is:
        //   /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/
        // But that doesn't allow for IDN in the domain, nor Unicode in the local parts, so we added the Unicode range >= U+00A0
        // including validation of surrogate pairs
        if (!eaiRegex.Match(parts[0]).Success)
        {
            return false;
        }

        // Use IDN APIs to check domain part
        // A better choice might be to use Windows.Networking.HostName as that is a bit stricter in its validation.
        IdnMapping idn = new IdnMapping();
        idn.UseStd3AsciiRules = true;
        try
        {
            idn.GetAscii(parts[1]);
        }
        catch
        {
            // Not valid IDN
            return false;
        }

        // Passed the test
        return true;
    }

Hope this helps when you need to check if an email address is reasonably well formed,

Shawn

EAI Examples.zip