CHM Localization and Unicode issues - dbcsFix.exe

In this thread (https://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=2139818&SiteID=1) we discussed about the localization issues in CHM. With September Sandcastle release we have addressed the Unicode issues for CHMs built in East Asian languages. Here’s the problem:

 

1. For localized CHMs the sources need to be in ANSI if the language’s characters don’t all map to Western-1252.

2. If you compile ANSI sources, the HTML help compiler assumes that the HTML codepage is your current system codepage. If your system is set to EN-US ( like my system), the resulting CHM contains incorrect characters, unless you change my system settings (which require a reboot).

 

To resolve this, one must:

1. Note the codepage in the source HTML (a META tag).

2. Re-encode all files in ANSI, using the appropriate code page.

3. Trick the OS in to stating its current codepage is something different than what it really is.

4. Compile.

 

Here’s our solution:

1. Have ChmBuilder write the codepage as UTF-8 into the HTML as it generates them (they actually are UTF-8 at this point).

2. Re-encode the files using dbcsFix.exe. DBCS stands for Double Byte Character Set and we use this program to convert UTF-8 to ANSI . While doing so, substitute the actual codepage (e.g., big5) for what was initially written (UTF-8).

3. Wrap the call to HHC.exe in a call to MS APPLocale or SbAppLocale.exe, passing in the appropriate LCID.

dbcsFix.exe Details:

dbcsFix.exe attempts to work around limitations in the CHM compiler regarding character encodings and representations. Specifically:

1. Replaces some characters with ASCII equivalents, as follows:

Char name

utf8 (hex)

ascii

Non-breaking space

\xC2\xA0

" " (for all languages except Japanese)

Non-breaking hyphen

\xE2\x80\x91

"-"

En dash

\xE2\x80\x93

"-"

Left curly single quote

\xE2\x80\x98

"'"

Right curly single quote

\xE2\x80\x99

"'"

Left curly double quote

\xE2\x80\x9C

"\""

Right curly double quote

\xE2\x80\x9D

"\""

Horizontal ellipsis

U+2026

"..."

After this step, no further work is done when LCID == 1033.

2. Replaces some characters with named entities, as follows:

Char name

utf8 (hex)

named entity

Copyright

\xC2\xA0

&copy

Registered trademark

\xC2\xAE

&reg

Em dash

\xE2\x80\x94

—

Trademark

\xE2\x84\xA2

™

3. Replaces the default "CHARSET=UTF-8" setting in the HTML generated by ChmBuilder with "CHARSET=" + the proper value for the specified LCID, as determined by the application's .config file

4. Re-encodes all input HTML from their current encoding (UTF-8, as output by ChmBuilder) to the correct encoding for the specified LCID.

USAGE:

dbcsFix.exe [-d=Directory] [-l=LCID]

-d is the directory containing CHM input files (e.g., HHP file). For example, 'C:\DocProject\Output\Chm'. Default is the current directory.

-l is the language code ID in decimal. For example, '1033'. Default is '1033' (for EN-US). Usage is also available with -?

After processing the inputs with dbcsFix.exe, the call to the CHM compiler must be made when the system locale is the same as the value set when calling this tool. This can be done either by changing your system settings via the control panel, or by MS APPLocale or by SbAppLocale.exe. In the latter case, the call should be similar to:

SbAppLocale.exe $(LCID) "%PROGRAMFILES%\HTML Help Workshop\hhc.exe" Path\Project.HHp

REFERENCES:

Here are some useful links about Unicode and general encoding issues:

Sincere thanks to my colleagues John Carl and Justin Russell for developing dbcsFix.exe. Cheers.

Anand..