Parts of LCIDs are a Bad Idea

I just posted a couple things about how We're all Naïve and Catching Globalization Biases, where I mentioned that platform and industry thinking has evolved over the years.  And then today I ran into one of those "good ideas" that really wasn't.

I've blogged in the past about using Locale Names instead of LCIDs, and any modern application should certainly do that, but the old (don't use them!) macros for handling LCIDs are still around.  The quirks of LCID construction and their history are good examples of how "we're all naïve" and how Good Ideas (at the time) seem obviously flawed after a few decades have past.

(I'll pause here to note that most of this is my conjecture, and that many of the people involved are very smart people.  Some of the reasoning here may have differed somewhat from my suppositions.  I'm merely trying to show how much the software industry has learned in the last half century).

A long time ago computing was a new idea, and we hadn't learned a lot of lessons.  The industry also had different limitations, like hardware speed and code size.  Since computers like numbers, when someone decided that it was time Windows needed to support multiple "locales", it made sense to try to give them all numbers.  ("locales" is even in quotes because thinking about locales/languages/regions and what makes sense has evolved as well since then.)

It's also kind of obvious to these American English speaking developers that there are a lot of flavors of some languages.  Maybe if they'd been Japanese they might have missed that and come up with a different plan, but the difference between American and British English is pretty obvious to English speakers.  We may even have an unconscious bias that makes this difference seem more important than it may to speakers of other languages like Japanese where that dichotomy doesn't exist.

Anyway, for whatever reason, since there are clearly variations of the languages, this number they were inventing, the LCID, was divided into two parts:  The Primary Langauge ID (LangID) and SubLangID.  Since there are a lot of languages, and there were only a few obvious sublanguages (English has US and British, and Canadian is sort of obvious from where I sit.  And there's Australian and the Caribbean, and a few others where it's pretty common.)  There may have been several big obvious variants at the time, but there are clearly more "languages", so 10 bits were allowed for the languages (surely more than enough!) and the other 6 for the sublang.  A few of those were reserved for special uses.  Macros were created to build the LCIDs from the two parts, and to extract them as well, and all seemed good.

One bit that was missed is that sometimes different languages sort stuff in different ways.  Another unconscious bias at work.  Perhaps if they were native Spanish speakers this would have been more obvious, but Spanish didn't work out perfectly and the sorting of Spanish in Spain needed to be adjusted.  This was unfortunate, but the workaround that seemed OK at the time was to add another sublang for Spanish (Spain), to sort "internationally" appropriate.  This wasn't perfect, now there were two LCIDs for one thing, but at least it was only one country.

Or was it?  Oops, other countries also have multiple sorts.  The CJK languages have interesting complexities that have caused a few different ways to sort them (pronunciation, stroke count, etc), and German has a "phonebook" sort for people's names as well as the dictionary sort.  Fortunately only 16 of the 32 bits were being used, and this is all about 32 bit windows, right?  So we can just add a little bit to the helper macros and create a sort ID.  That's sort of like a sublang id, but only for the sorts.

With all these macros, we needed a way to talk about them, so a bunch of #defines were created so that you could tell the LANGIDs apart.  Which one was English, which is French, which Japanese.  An if or case statement with LANG_ENGLISH is way easier to understand than "case 9:".  Of course the SUBLANGs are a bit klutzier since they depend on the LANGID, but still SUBLANG_ENGLISH_US is more readable than "1", especially since 1 isn't the whole story.

Over time a bunch of other limitations of this design became apparent.  The industry created what are now the BCP-47 language tags, and now LCIDs aren't recommended.  We've even evolved our thinking to focus more on languages and regions than "locales" as a combined concept.  (After all, you can speak British English within the United States, it's not an either-or thing).  Many of these limitations were difficult to foresee, and many that could have been predicted weren't necessarily considered in depth, whether because of the unconscious bias or other considerations.

One of the most obvious issues is that LCIDs lent themselves to a hexadecimal representation.  0x0409 for en-US. The 0x09 part is "English" and the 0x04 is the sublang (really 1, for US since this is English).  We have all the #defines to help out, and we have the official macros, but it sort of looks like 2 bytes of a word.  So some software didn't use the macro and just did an &0xff to chop them apart.  Just like that 2 bits are lost and instead of over 1000 possible languages, we really should probably only use 250 if we want to make sure programs keep working.

Well, along this time we've also realized that Ethnologue lists A LOT of languages.  Thousands.  Way more than 250.  Actually, people are starting to ask about new languages and we can foresee hitting that 250 mark.  Of course most of those languages don't have a lot of "sublang" potential (oops), but there are a lot of languages.  Some of those "one-off" languages are somewhat related, maybe deriving from the same original source.  How about giving some of those the same LANGID, and using the SUBLANGs to differentiate?  That staves off the problem, but now our macros and #defines have become a bit less interesting.

On the other side of things, people start asking for sublangs we hadn't considered.  Did you know that the EU has several official languages and documents can be found with those in many different countries?  Languages like English, Spanish & French are popular in many places well outside their nominal countries.  All those en-Caribbean things we bundled together.  Well, the Caribbean is actually a bunch of countries (and they speak some other languages too!)

Since we gave up 2 bits of those sublanguage ids to the langids, even if the langid can't really use them, there are only 32 variants allowed.  And some of those are reserved.  So now we don't have enough sublangids to assign one for every variant of something like English.  The "fix" is to add a second LANGID for English, but that sort of breaks that LANG_ENGLISH define we mentioned above.  Now we'll need a LANG_ENGLISH2 🙁  Fortunately, we added support for Locale Names and started encouraging use of those over a decade ago, and enough traction has been made that we probably can avoid adding a LANG_ENGLISH2 (by deprecating the LCID idea completely in favor of names), but it's pretty clear how that this is a decent limitation.

And then there's that whole thing about the American English speaker stationed on a post in Germany.  LCIDs don't really have a way to describe that person.

I should also point out that some of the quirks that happen in APIs like this happen due to innovation.  Oftentimes things need to iterate to improve, and that can be tricky in the OS when customers haven't been able to bang on your SDK yet.  The standards that would point out the problems haven't necessarily been invented yet (and hopefully incorporate the lessons that the innovator learned).

What started as a sane, well reasoned design has been put up against decades of learning about globalization and evolved thinking in the computer science industry.  To be fair, the design has withstood a lot of scenarios and has been a foundation of globalization in Windows for some time, but clearly we've learned a lot since then.  (Not just Microsoft, but the whole industry.)

In summary, many things influenced the evolution of the LCID design.

  • Limitations of the platform or technologies being used.
  • Innovating in a new area - best practices hadn't been created, and the lessons were still to be learned.
  • Possible unconscious biases of the developers and architects involved.
  • Actual mistakes (if that's what the es-ES_TS thing was).
  • An understanding of the globalization space from an earlier time in Computing Science, without the learning about globalization that the industry has made in the last several decades.
  • Implementations that take shortcuts like & 0xff instead of using the provided macros limiting the future assignments.
  • Changing requirements, like sorting variations, that shift the direction of how LCIDs work.
  • Attempts to work around the limitations of the LCID space that impinge on the original design - such as reusing the LANGIDs for multiple languages or the need to assign multiple LANGIDs for the same language.

The "lesson" (or whatever) of this tale is that many factors influence the design of global applications and that it can take a lot of experience to build a robust globalized system.  This is a really tough space.  Trying to be the first to do something new is bound to run into interesting challenges, and it helps to leverage the experiences of those that came before you.  If some behavior in any platform's Globalization support doesn't seem to make sense, dig deeper before you roll your own or try to special case for a designer's whim, there's probably a reason why it is done that way.

Hope this was interesting to someone,


Comments (0)

Skip to main content