Cantonese and Manderin language tagging.


The IETF “Language Tag Registry Update” working group has noted that lots of data is tagged as “zh-Hant”, regardless of whether or not it is pronounced as Cantonese or Manderin.  For video and audio however, this doesn’t allow a fine enough distinction, and so the LTRU is working on revising RFC 4646/4647 and the registry to allow for new tags to distiguish Cantonese and Manderin from the “macrolanguage” of Chinese.

So in the future we should expect to see “cmn” and “yue” tags instead of zh.  The LTRU is still a bit in flux about the details, but it is clear that in the future newly tagged data will use “cmn” and “yue”.  This is going to cause “an interesting time” since lots of legacy data, resources and systems will continue to use the zh tags.

User configurations may need to change, such as allowing both “cmn and zh” in a web browser’s language configuration.  Applications and systems may also need to change to provide “cmn” resources if “zh” was asked for, or vice versa.  Content providers may also need to retag existing data to distinguish between Cantonese and Manderin.

With these types of changes, the adoption rate is usually quite varied, so expect some applications and content to shift rapidly to using the new recommended names once that new standard is created.  Other data and systems will probably remain unchanged for a very long time, leading to very interesting scenarios when those environments communicate with each other.

Comments (4)

  1. Abel Cheung says:

    The origin of this confusion is that, Hong Kong and Macau people are actually using traditional Chinese version of Windows, which uses Taiwan locale as default. Most people don’t change the configuration at all (since changing configuration will cause most necessary IME to deactivate, thus not really usable). And not to mention, older IE has zh-cn and zh-tw (I haven’t checked IE7), but no zh-hk or whatever language tag. So most people don’t have any way to distinguish between zh-Hant and zh-yue.

    But anyway, don’t underestimate tolerance of Chinese people 😉  They have completely learned and familiarized such mixup situation, so any fix won’t be instantly effective. Perhaps ten more years or so. Not to mention, I have personally seen some small business still running Windows 98. Industries not using IT intensively are unlikely to change OS too much.

  2. orcmid says:

    I think it is "Mandarin."  The IETF seems to think so too.

    When I think about how important fine details are in this kind of tagging, and the general problems of accurately handling linguistic texts, my head wants to explode.  

    I figure the odds of getting it right seem to be decreasing as the richness and complexity of the tagging/collation/keyboard/font/… whatever increases.  There’s something about false assumption of precision because the terms are more precise but I am not at all sure about the practices.  

    Will this remain arcane and strictly subject matter for specialists, or is there a chance of it informing ordinary information technology and software?

  3. Abel Cheung says:

    @orcmid:

    I suppose it is understood to be a typo mistake 🙂

    Making users aware of this issue is close to impossible. In particular, they get used to current situation, and any change means they need to learn something new. Many non-technical computer users DO resist changes, no matter it becomes more "correct" or not.

  4. Shawn Steele says:

    I think you’re right, and that some users will resist changing.  I wasn’t trying to specify a "goodness" level about the issue :), but hopefully if the application developers understand the issues they can improve the change between the names because some people are certainly going to be using the new tags while others continue to use the existing tags.