IDN & Homographs

I’ve divided this into a few parts:


  • About IDN

  • IDN & Security

  • Homograph Thoughts

  • Conclusion

About IDN:


My interest in IDN is that I’m the SDE for the System.Globalization.IdnMapping class in Whidbey.  I also think its pretty nifty for the users in countries that use more than the basic Latin letters.


For those of you that don’t know, IDN/IDNA is trying to solve the problem of international (non-ASCII) characters in domain names. IDN is an “Internationalized Domain Name”. RFC 3490 - Internationalizing Domain Names in Applications has the details.  IDN only addresses domain names, it doesn’t attack the email address user name issue or other internationalization issues related to URLs/URIs/IRIs.


Before IDN, domain names were basically restricted to the Latin character set, A-Z, 0-9 and sometimes -. This is useful if your company is, but not so helpful if you’re company is in Chinese or Cyrillic characters. IDN provides a mechanism for encoding additional Unicode characters using the allowed a-z, 0-9 and - characters. So a name like きくどら.com ( Kikuna Driving School) or www.mä (Mäkitorppa mobile store) is represented like (きくどら) or (www.mä


So IDN doesn’t require any changes to the DNS layers of the Internet, but it does require conversion from the Unicode to the ASCII “Punycode” form of a name at some point. A Whidbey .Net application uses the System.Globalization.IdnMapping class to convert between the Unicode and “Punycode” forms.


In addition to the punycode conversion, IDN does some normalization using NFKC and additional mappings such as making the strings all the same case.  Some Unicode characters are considered ambiguous or dangerous and are disallowed in IDN, others are folded into a more common form to prevent some repetition.


IDN & Security:


IDN disallows some Unicode characters considered dangerous and “folds” others into a more common form in some cases if they are ambiguous.


Even with these restrictions, it was quite obvious that many look alike characters, or homographs, exist in Unicode.  Examples exist even in ASCII as you can construct as by using the little el and zero characters.  The DNS system would think this is a different domain name and send a user to a different server, yet, depending on your font it could be difficult to distinguish from the real domain name.


Unicode has tens of thousands of characters, so when the IDN RFC was created it was the homograph problem becomes even more complicated when Unicode characters are allowed.  For example, Місrоѕо can be written almost entirely in Cyrillic letters (this example has only the r, f & t in Latin. Я just doesn’t look quite the same ;-)).


Even worse, some scripts have characters that are difficult to distinguish.  Many Chinese characters appear very similar in small fonts.  Other scripts have minor diacritics that could be missing or slightly modified such that the user might not notice.  Due to the complexity of the problem, the IDN RFCs leave the homograph problem to be resolved later, perhaps by the registrars or a future RFC.


Since IDN doesn’t directly address the homograph problem, users could be susceptible to spoofing, phishing and other social engineering attacks.  This is exactly what happened with the recent paypal attack.  The IDN name pа was registered with a Cyrillic a for the first A. is the punycode version of this name. 


A user following such a link in some browsers would see what looked like in their address bar, but would actually be a different web site.  An email or link from another web site could be used to trick a user into providing their paypal information to an attacker.  This type of attack is similar to the socially engineered emails that have already been used to try to get users to enter personal information by trying to get them to go to or some such URL instead of a real vendor site.


Some people were amused that Mozilla, Firefox and other browsers were susceptible to the pа homograph attack, but Internet Explorer is not (because it doesn’t do IDN conversion).  Equally interesting is the browser reaction of removing IDN support and then choosing to display the Punycode name instead.


Personally I don’t think this is just an IDN weakness.  Rather IDN merely makes an existing problem with trusting links more obvious. or would catch many users anyway. 


Homograph Thoughts:


My thinking is that basically the IDN pа attack where the first A is Cyrillic is a social attack.  For this attack to succeed, a user must first follow an untrusted link.  That link could be a web site (please buy my book at or an email (click here to update your bank information).  Some users are already wary of following unsolicited links from email, but don’t think twice about a web link.  In either case, a look-alike name in the address bar of the browser would be reassuring.


Several solutions to the homograph problem have been suggested.  I don’t have a magic bullet, but I think that the root problem is a user education/social engineering problem, remember in some cases these attacks can even happen with the non-IDN DNS names.  The following are suggestions I’ve seen in various places and my thoughts about them.  Other people & coworkers disagree, so these are merely my thoughts.


Several suggestions seem American centric, and I’m disappointed that the developer community doesn’t have a broader global perspective.


·         Disabling IDN – This is perhaps the most obvious suggestion, but seems quite short sighted.  After all IDN was created to solve a real problem for DNS names that are not ASCII.  If your corporate name is きくどら this “solution” doesn’t help at all.


·         Displaying Punycode – This is also a quite non-global suggestion.  While showing punycode solves the particular pа problem, its even worse for the きくどら user. is the same as xn— to their users (xn— would be ٮ٨٧٩ٯٲٳ.com, cool it even decodes!)  Even in the www.mä case how is a user supposed to know if its or  For non-US users displaying the punicode doesn’t solve the problem, it makes it worse.


There are many other suggestions though.  My concerns with these suggestions are that either a) they are too restrictive, preventing reasonable names, or b) they aren’t sufficient to catch all of the problems, or both.


  • Registrars Should Prohibit Bad Scripts – The idea is that since certain countries would only expect names in their language(s), only those scripts should be allowed.  Some registrars are doing this and it seems reasonable, however since “everyone” uses .com and the other well known top level domains, this solution is pretty incomplete.  I also wonder if this might be a bit too aggressive.  What about a Chinese grocery in Germany?  It seems reasonable that they might want a Chinese character .de domain name.  Fortunately for them (but not for .com), they can currently fall back to .com in this case J


  • Disallowing Mixed Scripts – The thought is that since paypal was spelled with mixed Cyrillic and Latin, then disallowing such mixes should prevent such attacks.   Unfortunately Latin is often allowed with other scripts, particularly for businesses with tech names where it is sometimes popular to pick up words in Latin scripts.  Variations of this suggestion include disallowing mixed scripts, except that Latin could appear with most scripts, except Cyrillic.  I question however whether this is accurate or not.  I can easily imagine an import-export company with a multi-script-but-not-Latin name.  I’ve also heard that even Cyrillic can rarely be used with Latin, although I don’t know how the user’s supposed to type a mixed Cyrillic/Latin name.


  • Disallow Non-User Scripts – Check the user locale or keyboard or whatever to see if the URL is a script the user uses.  Many users however have 2nd language skills that aren’t necessarily represented by their current locale or keyboard choice.


  • Prohibiting or Normalizing Homographs (applications) – This is an interesting suggestion, but I have several concerns.  Which form is “correct” if a browser encounters a Cyrillic and a Latin version of the same URL?  I’d imagine that this happens often.  Who is going to find out which ones look alike?  Using which fonts?  I cannot imagine that all Homographs could be caught.  There’s probably a billion combinations to check, and we (the developer community) would only catch the obvious ones.  The “bad guys” would undoubtedly catch the one(s) we missed.  This also doesn’t help with some subtle character differences, even í, i, ì, and ï look pretty close and its pretty obvious that í and ì can’t be rejected just because they look like i.


  • Certificates Are The Solution – This has the same problem as trying to get the Registrars to do it, although since the fees are higher they might have more resources to address the problem.   Again only 1 CA would have to let a bad name pass through.  Additionally despite the little key icon I suspect many users don’t realize if they’re on a secure connection or not.


  • IDN APIs Should Filter Bad Names – If homograph filters are implemented, the RFC, Registrars or Unicode should drive them.  Obviously different software vendors can’t have different filtering or else those names would break, and I have difficulty suggesting that a registered name cannot be accessed by a software program.


  • Users Should Retype URLs – This seems a bit harsh.  Personally I only care if it’s a “safe” URL if I’m about to enter personal data.  Many URLs are too long and this would also be bad for those people that get credit for referrals to Amazon and the like.


  • Display URLs in Lower Case – Some have suggested that lower case letters are less likely to be confused, however there are still cases like rn vs m. (R + N vs M), and í, i, & ì that look very similar.  Font choice could help, but probably can’t solve the problem, particularly for some users.


Some suggestions seem more reasonable to me.  (Of course, I’m just me, so other people probably have other ideas).



  • Whitelisting – This would be some sort of mechanism for trusting sites.  I like this because it would address IP addresses in phishing mail, URLs with keywords and other attacks.  The idea is that if the user tries to enter a form on a new site then they’d be prompted before the action was allowed.  If the site was already whitelisted no prompt would appear.  So a user that used would have some warning if they followed an e-mail link to or  The downside is that determining which forms require extra protection could be hard.  Additionally kiosks like those in hotels or libraries couldn’t persist the whitelist or else someone could intentionally whitelist an unsafe site to trap future users.


  • User Education – At some level user’s need to be aware of what behaviors are safe or unsafe.  Unfortunately there are a LOT of users, so even a fraction of a percent that remain unaware of these issues would still be at risk.  This could also address problems such as users who use the same name & password for every site they register.  One bogus contest entry and they could be at risk if the same credentials worked on paypal.



Various technical problems, including IDN, can be combined with well-designed social attacks to allow a user to trust a web site that shouldn’t really be trusted.  Vigilance by the registrars could prevent many homograph attacks, however some will undoubtedly still be possible.  Font choices and browser behavior might limit a few more mistakes, but those can be offset by poor eyesight, dyslexia, monitor differences, color choices (user or application), platform (mobile or PC) and other differences.  User education can also help catch a percentage of the problem.


It seems to me that all of these taken together will reduce the available surface for attacks, but there will still be a window for the attackers to attempt their exploits.  Many of those socially engineered attacks don’t even require IDN Homographs to trap some unwary or uneducated users.


Comments (3)
  1. Unicoder says:

    Oh no! Between you and Michael I’m going to waste even more time on these blogs…

    (Thanks, very informative, hope you become as big a blogosphere legend as Mr Kaplan… 🙂

  2. I posted a suggestion on Michael Kaplan’s blog that I (currently) think this might be a better solution than the ones you have suggested: simply treat similar looking domain names as duplicates. If you go to a registrar and buy, they should not allow anyone to register anything that looks similar to it.

    The obvious problem here is that it is hard for the registrar to automatically tell what looks similar but I think you could start with a fairly simple set of rules and extend them as people complain (so if Paypal find someone is using a domain name that looks remarkably similar to them, they would have a very good case for getting the rules extended). Once you have the general principle that similar looking names are duplicates, the disputes process should be fairly easy (of course there would be problems but nothing is perfect).

    I can’t think of any case were two similar looking domains should be allowed and it would be unfortunate if companies have to start trying to work out similar looking domain names and registering them.

  3. Shawn Steele says:

    Unicode is working on a standard to try to help registrars figure out which names/characters are look-alikes, but its a hard problem. For some scripts its easier than others. Some scripts will probably still be susceptible to problems. Chinese for example has numerous characters, and in small fonts many may look alike. Also you don’t need very many chinese characters to make a name, so it might be had to say that one short string looks too much like another short string if they both have legitimate meanings.

    So it seems like a good thing for the registrars to try to prohibit homographs, but I suspect there’ll still be many holes…

Comments are closed.

Skip to main content