RichEdit Language Tag Handling

For years, many applications have used the locale ID (LCID) to identify the language and locale for text and other data. For example since 1997 (RichEdit 2.0), RichEdit’s character formatting has included CHARFORMAT2::lcid. The LCID can, in fact, describe the vast majority of language/locale combinations in use as far as text is concerned. However the more general BCP-47 language tag is based on international standards and it can describe virtually any language/locale combination and it can include additional subtags such as for script and private-use. Accordingly the software industry has been migrating from LCIDs to BCP-47 language tags. Since myriad existing documents use LCIDs, text engines still need to support LCIDs as well. The question arises as how to support both properties gracefully and in a way that doesn’t require tearing existing programs apart. This post describes how the current RichEdit does this.

The RichEdit that ships with Windows 10 and Office 2016 accomplishes the task in a simple, elegant way by using a 32-bit HLANG handle that happily masquerades as an LCID through existing LCID-dependent APIs. So to support BCP-47 language tags you don’t need to change your interfaces! Think of the HLANG as a 32-bit BCP-47 language tag handle. If you examine it, values with a zero sign bit are actually valid LCIDs, that is, they correspond to BCP-47 tags that can be represented faithfully by an LCID. If such an LCID doesn’t exist, the sign bit is set and the HLANG is a -1-based index into a process-wide table of currently registered BCP-47 language tag strings.

Internally, RichEdit’s character formatting uses the HLANG in place of the LCID. Clients retrieve an HLANG for a BCP-47 tag by calling RegisterLanguageTag(WCHAR *wszBCP47), a function exported by the RichEdit dll. This is similar to how clipboard formats are handled on the Windows desktop, where you call RegisterClipboardFormat(char *szFormat) and get back a clipboard format handle (a WORD) for szFormat. If that string is already registered, you get back the handle delivered to the first caller. In principle, RegisterLanguageTag() could also be implemented in the OS and allow programs independent of RichEdit to handle BCP-47 language tags without changing their interfaces. But for the moment anyhow, it’s implemented in RichEdit. RichEdit exports two additional functions, one for retrieving the BCP-47 language tag from an HLANG and one for retrieving the LCID from an HLANG:

    BSTR GetLanguageTag (HLANG hlang);

    LCID GetLcidFromLang (HLANG hlang);

The RichEdit client can use HLANGs in place of LCIDs in any RichEdit API that takes LCIDs. These APIs include EM_GETCHARFORMAT, EM_GETRANGEFORMAT, EM_SETCHARFORMAT, EM_SETRANGEFORMAT, ITextFont::SetLanguageID and ITextFont::GetLanguageID(). See also the API Generalizations section below.

The function prototype typedef’s for using the exported functions RegisterLanguageTag, GetLanguageTag, and GetLcidFromLang are named PFNREGISTERLANGUAGETAG, PFNGETLANGUAGETAG, and PFNGETLCIDFROMLANG, respectively, and are defined by

    typedef HLANG(WINAPI* PFNREGISTERLANGUAGETAG)      (const WCHAR *);

    typedef BSTR    (WINAPI* PFNGETLANGUAGETAG)(HLANG);

    typedef LCID    (WINAPI* PFNGETLCIDFROMLANG)(HLANG);

 These function pointers can be obtained by calling GetProcAddress(hRichEdit, szFunctionName), where hRichEdit is a handle to the RichEdit module and szFunctionName is one of these function names.

Sample Code

   // Test a standard BCP-47 language tag
  HLANG hLang = RegisterLanguageTag(L"en-US"); 
 
  LCID lcid = GetLcidFromLang(hLang); 
    TestAssert::AreEqual(lcid, 0x0409, L"Verify the LCID"); 
 
   BSTR bstr = GetLanguageTag(hLang); 
 TestAssert::AreEqual(L"en-US", (WCHAR *)bstr, L"Verify the language tag"); 
 SysFreeString(bstr); 

   // Try a possible math BCP-47 tag 
  hLang = RegisterLanguageTag(wszMathTag); // L"und-Zmth" 
    lcid = GetLcidFromLang(hLang); 
 TestAssert::AreEqual(lcid, MATH_LCID, L"Verify the LCID"); 
 
    bstr = GetLanguageTag(hLang); 
  TestAssert::AreEqual(wszMathTag, (WCHAR *)bstr, L"Verify the language tag"); 
   SysFreeString(bstr); 

   // Try a tag with no faithful LCID 
 hLang = RegisterLanguageTag(L"sl-IT-nedis"); 
   bstr = GetLanguageTag(hLang); 
  TestAssert::AreEqual(L"sl-IT-nedis", (WCHAR *)bstr, L"Verify language tag"); 
   SysFreeString(bstr);

Note that all current uses of the math LCID (0x1007F), such as math autocorrect, can be handled faithfully by the math LCID. But for clients that need a BCP 47 string for math zones, the BCP 47 math tag “und-Zmth” is returned by GetLanguageTag(). This differs from what the Win32 LCIDToLocaleString() returns, namely “x-IV_mathan”. It turns out that “x-IV_mathan” isn’t a valid BCP 47 tag and it’s a private use tag intended for sorting math alphanumerics. Internally RichEdit does use the Win32 language-tag conversion functions, but overrules them where appropriate.

RegisterLanguageTag() tries to find a faithful LCID first, which works in most cases. If no faithful LCID is found, RegisterLanguageTag() currently linearly searches the registered language tag string table for the string. If the string isn’t found, it is added to the table. If telemetry reveals that the string table often contains more than a few strings, a parallel table could be maintained that’s sorted for binary searches.

File Format Generalizations

For the RTF format, if a text run’s LCID is faithful, the \langN control word is written as usual. If not, {\bcp47{\*\langtag ...}\langN} is written, where the ellipsis is replaced by the BCP-47 language tag. If the RTF reader understands \bcp47, then it uses the \langtag field to get the BCP-47 language tag; else it skips the \*\langtag group and reads the best-fit \langN. Note that the docx and pptx file formats already use BCP-47 language tags instead of LCIDs. But up to now, only tags corresponding to faithful LCIDs have been supported, since LCIDs have been used internally.

API Generalizations

In general all RichEdit messages and interfaces that use LCIDs can use HLANGs in place of the LCIDs. This works in process. If we want inter-process API calls, we’d need to make the exported functions like RegisterLanguageTag OS functions, rather than exported functions that work on a per-process basis. But copying text in binary or RTF file formats works between processes since the HLANG’s are exported as BCP-47 tags for BCP-47-aware readers.

Here are some of the messages and interfaces that can be used:

EM_SETCHARFORMAT, EM_GETCHARFORMAT, EM_SETRANGEFORMAT, EM_GETRANGEFORMAT. These messages use CHARFORMAT2, which has an LCID member.

GetStringTypeEx(). Note that CW32System::GetStringTypeEx() can wrap the changes needed to call a new system function given an HLANG instead of an LCID.

ITextFont::GetLanguageID(), ITextFont::SetLanguageID(). These methods use LCIDs. Note that ITextRange2::GetText2() and SetText2() have a BCP-47 language-tag option (see tomLanguageTag), so starting with Windows 8, RichEdit has had some BCP-47-aware functionality. The LanguageTag property only worked when the tags could be represented faithfully by an LCID, but this is upgraded with the new design.

AutoCorrectProc and EM_SETAUTOCORRECTPROC. Use LCIDs and can use HLANGs instead.

Keyboards. Currently use HKL, which is based on the LCID. If IsTransientLcid() returns a TRUE value, then one needs to get the BCP-47 tag by calling ABI::Windows::Globalization::ILanguageStatics::get_CurrentInputMethodLanguageTag().

There are also some undocumented RichEdit interfaces that use LCIDs and these work with HLANGs as well.

Transient LCIDs

The Windows desktop defines keyboard transient LCIDS to be LCIDs that have the primary language id (low 10 bits) = 0, such as, 0x2000, 0x2400, 0x2800, 0x2C00, etc.  0x0c00 is used if the locale is the user default locale and 0x1000 is used for all locales that aren’t in the language profile. Currently IsTransientLcid() only returns TRUE for the four values 0x2000, 0x2400, 0x2800, 0x2C00 and these values are worth supporting on the desktop. But there are only 64 16-bit values with a zero primary language id, so transient LCIDs, while worth supporting, are too limited for HLANG. Note that they don’t currently overlap with the language tag indices used by HLANG when LCIDs are unfaithful.

That’s it. To generalize a program that uses LCIDs, just use HLANGs instead and you can handle arbitrary BCP-47 language tags. Of course, you still need to write code that takes advantage of these language tags. For example, RichEdit has functions to get the charrep from an HLANG and vice versa, which are needed for choosing fonts for scripts. The ISO script subtag of the BCP-47 tag may be used for this and for the most part charreps correspond to ISO scripts. There are exceptions, such as there is an emoji charrep and no dedicated ISO emoji script tag, but that’s another subject.