MSI Databases and Code Pages

A Windows Installer database is full of strings. Most times those strings don’t cause a problem when using the standard, printable characters found in all code pages. These are called ASCII characters and are the same for the first 7 bits (0x00 through 0x7F) for all code pages except for a few rare code pages in existence for legacy support. If a Windows Installer database requires extended characters — characters where the 8th bit is set (0x80 through 0xFF) — then a code page is necessary to define how those characters are displayed. For example, decimal character 255 is ÿ in ANSI code page 1252 (ANSI – Latin1) but я in ANSI code page 1251 (ANSI – Cyrillic). The database code page is used to display strings in Windows 9x/Me and used to convert strings to Unicode on Windows NT when calling the W functions.

It is recommended to use only ASCII characters and then you can author a database with a neutral code page (0). Such a database could be used by any language. If you must include extended characters, you should set the code page for the database before importing any strings or risk corrupting extended characters. For localized product installation databases this would be common, since many languages require extended characters. Once you set the code page for a database all imported text files must specify the same code page or the import will fail. A file to be imported — common referred to as an IDT archive file — would look like the following example:

Property	Value
s72 l0
1252 Property Property
ProductLanguage 1033
ProductName Microsoft Visual Studio 2005 Team Suite — ENU

The first row contains the column names and the second row contains their respective types. The third row contains the optional code page, followed by the required table name and an optional list of tab-delimited primary key column names. The example above is part of the Property table for Visual Studio 2005. I have inserted 1252 as the code page for this example since the English SKU uses only ASCII characters.

You can easily display or change the code page for the database — along with the supported package languages and the product language for strings not authored into the MSI database (such as Windows Installer error message not in the Error table) — using WiLangId.vbs from the Windows Installer SDK, part of the Platform SDK.

Unofficially, MSI databases do support UTF-7 and UTF-8 by specifying code pages 65000 and 65001, respectively. Encoded strings will store correctly and will be converted correctly when the W functions are called, but they may not display properly because the correct font for wide characters is not chosen.

With this in mind, don’t be surprised if you open a database with a code page different from your current system code page in Orca and find that some characters are not displayed correctly (they will most likely appear as boxes or simply the wrong character). The strings are being displayed or converted to Unicode according to the database code page.

It’s also important to note that the database code page is different from the Summary Information stream code page, which is property ID PID_CODEPAGE (1). This is the code page in which the summary information properties are encoded.