How can I get the default code page for a locale?


A customer had an LCID and wanted to know what the code page is for that locale. For example, given locale 1033 (US-English), it should report that the code page is 1252 (Windows Latin 1). They need this information because the file format uses ANSI strings, and the file format for some reason doesn't provide a code page, but does provide a locale.

You can ask Get­Locale­Info for the LOCALE_IDEFAULT­ANSI­CODE­PAGE to get the ANSI code page for a locale.

UINT GetAnsiCodePageForLocale(LCID lcid)
{
  UINT acp;
  int sizeInChars = sizeof(acp) / sizeof(TCHAR);
  if (GetLocaleInfo(lcid,
                    LOCALE_IDEFAULTANSICODEPAGE |
                    LOCALE_RETURN_NUMBER,
                    reinterpret_cast<LPTSTR>(&acp),
                    sizeInChars) != sizeInChars) {
    // Oops - something went wrong
  }
  return acp;
}

This function uses the LOCALE_RETURN_NUMBER flag to say, "Hey, I know that the Get­Locale­Info function normally returns strings, and that's great, but we both know that this thing I'm asking for is an integer (because the name beings with an I). Officially, you need to take that integer and convert it to a string, and officially I need to take that string and convert it back to an integer. How about let's talk like people and you just give me the integer directly?"

And even though you didn't ask, you can use LOCALE_IDEFAULT­CODE­PAGE to get the OEM code page for a locale.

Bonus gotcha: There are a number of locales that are Unicode-only. If you ask the Get­Locale­Info function and ask for their ANSI and OEM code pages, the answer is "Um, I don't have one." (You get zero back.)

Comments (31)
  1. Roman says:

    If you select a Unicode-only locale, how do the ANSI functions (*A) work?

    1. Darran Rowe says:

      They most likely don’t work consistently.
      There are some functions that are implemented by calling WideCharToMultiByte, converting it to a Unicode string, and this will probably fail because of the lack of the ANSI codepage.
      So in general, if you are using a Unicode only codepage, use the W functions.

  2. Myria says:

    If only Windows allowed you to set CP_UTF8 as the default code page… *sigh*

    1. Darran Rowe says:

      Of course, there is one other way to look at this.
      If only Linux didn’t choose the awful hack to keep using codepages when they had the chance to make Unicode output independent of the codepage. All of this was just to be lazy and allow them to keep using what they knew… *sigh*

      1. Joshua says:

        That’s nothing to do with it and you know it. Setting your locale to UTF-8 is *how* you detach it from a code page.

        GP wants to do this on Windows so he can use UTF-8 on the myadrids of non UTF-8 programs, most of which will work if given half a chance.

        1. Darran Rowe says:

          Well, tbh, it this was basically a counter moan to a rather stupid moan.
          The choice that Windows took a long time ago to do what it did to output Unicode can be viewed as just as stupid as the choices that Linux took to get it to work.
          But no, setting your locale to UTF-8 is not how you detach from a code page. First, a locale has a lot more in it than just the code page for the character set it uses, the formatting for dates/times and other stuff is in there. Secondly, just because you set your locales code page to UTF-8, doesn’t mean you are detached from code page output. By definition you are still using a code page.
          The real problem here is the programming languages. For C, it only really has functions for char and wchar_t, and because of how undefined wchar_t is, then it is hard to use portably. So as a crutch, UTF-8 was used a lot throughout the language as a code page and pretended that it wasn’t a code page.

    2. Dan says:

      Microsoft had the luxury of writing a new operating system (Windows NT) from scratch, which introduced the opportunity for a shiny new API using wide characters. The UTF-8 locale “codepage” approach (as used in the *nix world) has the advantage of backwards-compatibility.

      Using UTF-16 for the OS API wasn’t IMO *inherently* a poor decision, but it interacts poorly with the C++ standard library that basically treats wchar_t as an afterthought. For example, there is no wide-character version of std::exception::what. Nor a wide-character equivalent of fopen (_wfopen is a Microsoft-specific extension) or even main (_wmain, again, is Microsoft-specific).

      In the 7-bit ASCII days, you could write a cross-platform library by sticking to “Standard” C or C++ as much as possible and falling back to low-level OS-specific functions (wrapped in #ifdef) only when doing something that wasn’t standardized (like walking a directory tree, spawning a thread, or making a GUI).

      But on Windows, all Standard C++ functions that take a char* are “defective” in that they can’t represent the entire character repertoire supported by the underlying OS for filenames or console output. So to support Unicode properly, you have to actively *avoid* the standard library and use the Windows-specific _w functions.

      This makes it a royal PITA to write code that is simultaneously Unicode-aware and cross-platform. I’ve had to do this. The approach I’ve taken is to use UTF-8 everywhere as the standard character encoding, and write wrapper functions that do a behind-the-scenes UTF-8 to UTF-16 conversion on Windows, like so:

      FILE* fopen_utf8(const char* path, const char* mode)
      {
      #ifdef _WIN32
      FILE* hFile = NULL;
      _wfopen_s(&pstFile, ToWstring(path).c_str(), ToWstring(mode).c_str());
      return pstFile;
      #else
      return fopen(path, mode);
      #endif
      }

      …and repeat as necessary for stat, getenv, etc.

      If Windows supported CP_UTF8 as the ANSI code page, I could just use good old fopen.

      1. Darran Rowe says:

        I would say you are slightly wrong with this.
        The thing to remember is that to use the new Unicode related stuff, these applications would have to be rebuilt for 32 bit Windows anyway. The functions in the Windows headers are protected by macros to choose which versions of the functions you wanted, choosing ANSI by default.
        So just by running the 16 bit application you would get ANSI behaviour, if you recompiled it and fixed the 16 bit – 32 bit problems, then you would still get ANSI behaviour if you didn’t give the compiler the Unicode macro options.
        Also, remember that Microsoft were early adopters to the Unicode standard, and they got bit by this. You say “Using UTF-16 for the OS API wasn’t IMO *inherently* a poor decision,” but back when they adopted it, there wasn’t a UTF-16, there was only UCS-4 and UCS-2. UTF-8 wasn’t fully specified until 1992-1993, and Windows NT and the Win32 API was developed 2 years prior to that.
        But one thought here is that the problem isn’t the choices of the O/S’ involved, but the choices of the programming languages. C still isn’t fixing the functions that are lacking and there is no good standard library to work with I/O in Unicode. C++ has tried to fix some of these, but it is still lacking, and the last time I checked, iostreams still couldn’t output char16_t or char32_t based strings.

        1. Yuhong Bao says:

          I have been thinking for example to warn developers that a MBCS may exceed two bytes with NT 3.1, then actually implement UTF-8 as ACP/OEMCP with NT 3.5.

          1. And then everybody will says “Don’t upgrade to NT 3.5. It screws up strings in pretty much all apps.”

          2. Joshua says:

            @Raymond: That’s not what would have happened. What would have happened is “Don’t use UTF-8 code page; it breaks programs.” And a decade from then those programs would have faded into obscurity.

            The obvious trick here is to implement the darn thing but *don’t* make it the default. You don’t want it to be the default anyway because it messes with how you interpret text files. And then when AppLocale came along (it was rather obvious it would come along) it would more or less clean up the problem.

          3. Yuhong Bao says:

            Yea, I agree that it should not be the default. Also NT 3.1 and 3.5 would have been so early that there would not have been many Win32 programs in the first place.

          4. Would you have delayed the release of Windows NT 3.5 to fix all the in-box apps so they could support UTF-8 as the ANSI code page?

          5. Darran Rowe says:

            The big problem is, UTF-8 didn’t really take off until the explosion in the use of the internet around 2000 or so. Until that time, it was anyone’s guess as to what way things would go. What’s more, when did Linux actually start using UTF-8 as an output character set? The earliest articles I can find date it back to 2000 or 2001, so there is also the expectation here that Windows should have somehow seen 7 years into the future and made a decision about an encoding that didn’t even exist when the Win32 API was designed.
            But anyway, instead of complaining like this about past decisions, maybe all of you should put effort into standardising a library and proposing to either the C or C++ standards committee for adoption? In the end this is the only way to get proper standards defined behaviour without relying on implementation defined behaviour.

          6. Joshua says:

            @Raymond Chen: In fact only two, but this might be with the benefit of hindsight. Notepad would need to be fixed (and we *know* this one is trivial) and the console would need to be fixed (we now know this one as conhost.exe but it was once part of csrss.exe).

            If you managed to guess AppLocale would exist the rest can be derived but it’s far less obvious if you didn’t manage it. I might well have held for conhost because of the fact the bugs are reachable now and were reachable then.

          7. Plus File Manager (because if File Manager required AppLocale, that would suck), and the common file dialogs (because that runs in-process, and the host process might not have AppLocale enabled), and the window manager and the standard controls and the common controls, and GDI too. And then of course people would say “Why bother adding support for UTF-8 as CP_ACP if you can’t even finish the job?”

          8. Joshua says:

            All the A->W functions would just work. On the other hand for file manager if it’s not converted to UTF-8 locale support, it’s broken if it doesn’t call all W functions. If it calls all W functions it needs no conversions. Same for OpenFileName/SaveFileName. AFAIK This leaves only the edit control. (Everybody else could be an A->W function). The W edit control didn’t work until Vista.

        2. Yeah, UTF-16 not exist when NT was created, and in fact Unicode was specified as a having a 16-bit code space until July 1996, so the choice to use a 16-bit representation for WCHAR really wasn’t that unreasonable given the information available at the time.

          Note: If, like me, you find yourself extremely confused as to why the Unicode consortium would bother to specify a 32-bit interchange format for a 16-bit code, you’ll be relieved to hear that this did not in fact happen. In fact, UCS-2 and UCS-4 come fromISO 10646, which originally specified a 31-bit code space, for which a 32-bit interchange format is the glaringly obvious approach. ISO 10646 divided this 31-bit code space into 128 groups of 256 planes of 256 rows of 256 cells. Originally, UCS-2 was intended to be used along with ISO 2022 escape characters to represent characters beyond the Basic Multilingual Plane; to make this work, it would have been necessary to ban octets in the C0 and C1 control ranges (0x00-0x1F and 0x80-0x9F, respectively). If you’re not horrified yet, consider that ISO 2022 is stateful, that Unicode 1.0 used many such codepoints (anything from 0100 to 01FF, for starters, plus something like a quarter of the rest), and so this concept for ISO 10646 would have left us with two character encoding standards to rule them all, if adopted. (And still using ISO 2022 for code-switching.)

          Fortunately for all of us, the computer industry talked enough national bodies into voting against that version of ISO 10646, so the ISO 10646 standardizers ended up dropping the whole ISO 2022 thing, including the prohibitions on using C0/C1 bytes as components, and then negotiating a unification with Unicode. (But this was all too late to really affect NT’s API: Unicode 1.1 (the first unified version) didn’t ship until June 1993, and NT 3.1 shipped on July 27, 1993.)

          Unfortunately, it turned out that a 16-bit code space was not enough, so now we have UTF-16 and (I assume) a lot of software that doesn’t support it properly, because for most european languages the BMP is quite enough (aside from the occasionally emoji, which will only rarely trigger anything but the simplest bugs) :-(.

          1. Dan says:

            The original 31-bit ISO 10646 proposal is also why UTF-8 was originally designed with support for 4-, 5-, and 6-byte code sequences. (BMP characters only require the 1-, 2-, or 3-byte sequences).

            This came in handy when the Unicode Consortium decided that 65 536 characters weren’t enough for everyone after all. Frameworks using 16-bit characters to represent Unicode strings had to introduce the variable-length UTF-16 “surrogate” mechanism in order to be able to use the new characters. UTF-8-based software merely needed to dust off a subset of the 4-byte sequences (representing U+10000 through U+10FFFF) that had originally been intended for ISO 10646.

            Exercise for the reader: Suppose that Unicode decides to add so many emoji or obscure historical logographic scripts that the existing 17-plane code space becomes too confining. Design an encoding for representing the extra characters in 16-bit string APIs that’s as backwards-compatible as possible with UTF-16.

      2. Myria says:

        One little problem: UTF-8 was published about 6 months before Windows NT 3.1 came out. UTF-8 didn’t exist when the OS was designed and most of it was being written.

    3. The problem is that a lot of apps assume that MaxCharSize for CP_ACP is never more than 2.

      1. osexpert says:

        Won’t MaxCharSize be 1 for UTF8? Isnt the whole point that UTF8 is indistinguishable from ANSI?

        1. Darran Rowe says:

          The point of UTF-8 is to make a byte oriented Unicode encoding, it has nothing to do with making it indistinguishable from ANSI.
          Unicode itself has combining characters, so a single character can be made up of two or mode code points. UTF-8 is an encoding of Unicode, so it has to be able to support everything that Unicode itself supports, this means that UTF-8 has to allow variable length characters.

          1. osexpert says:

            Unicode says it is compatible with ASCII so I don’t see MaxCharSize = 1 as completely wrong. If code can handle ASCII it can handle UTF8, thou it would be wrong in case it uses own logic for strlen etc. but should be ok if it uses OS functions and OS knows how to handle UTF8 (this is the missing part). I think it would work fine in most cases, but would probably have to be an opt-in\own risk. No guts no glory:-)

          2. Dan says:

            @osexpert: How do you define the “length” of a string? As the number of user-perceived graphemes (which may differ, for the same sequence of code points, between different languages)? As the number of Unicode code points? Or as the number of bytes the string takes up in memory? The last of these is quite useful in practical situations, like allocating a buffer in which to store a string, and it works just fine with UTF-8.

            The difficulty with ASCII-centric code is functions like toupper/tolower that work on individual char values, which obviously don’t cope with multi-byte characters.

            But, *if* a program or library is compatible with “any locale-specific character encoding that’s a superset of ASCII” (I’m excluding here font rendering engines or similar components whose raison d’être is low-level text manipulation, that care very deeply about which specific encoding a string is in), and does *not* have any hard-coded assumptions that 2-byte, 3-byte, or 4-byte characters can’t exist, it can cope perfectly well with UTF-8.

        2. When the user hits the left or right arrow key, or hits Backspace or Delete, the program needs to decide how much to move / how many bytes to delete. Old programs do this by checking whether a byte is a DBCS lead byte. If so, then it and the next byte are considered one unit. Otherwise, just the byte itself is the unit. This completely falls apart not just for UTF-8 but for Unicode in general, because Unicode grapheme clusters can span multiple code points. if you report MaxCharSize=1, then the cursor is going to be in the middle of a UTF-8 sequence, and then hitting backspace will do, um, interesting things.

          1. French Guy says:

            It’s quite simple to delete a full code point in UTF-8, since it’s easy to tell whether a given byte is leading (non-leading ones are of the form 10xxxxxx). You keep deleting until you hit the leading byte (and delete it). To delete a full character, you do more of the same. When deleting a code point, you check its general category. If it’s a combining mark (general category Mc, Me or Mn), you continue deleting (until you delete a code point that is not a combining mark).

            However, the maximum length in bytes is quite high given the sheer number of combining characters in Unicode (and if you’re not limited to distinct combining characters, there’s no maximum length at all).

        3. Dan says:

          For UTF-8, MaxCharSize = 4. This actually works if you call GetCPInfo with CP_UTF8 as the first parameter.

  3. This probably explains why the Unix world could (eventually, mostly?) get away with switching to UTF-8 for the locale-default encoding: relatively few programs were visual to start with, and I really don’t think very many of those were keen on implementing their own text entry widgets from scratch, so this kind of cursor-motion craziness probably didn’t crop up so much.

    Also, at least on GNU/Linux, it was/is comparatively simple to change encodings: rather than being a purchase-time choice, system-wide setting, or per-user setting as on Windows, the locale is determined by environment variables, and working around a GUI program’s inability to handle UTF-8 could be as simple as setting something like LANG=en_US in its environment (assuming you still had that locale listed in /etc/locale.gen (compared to the LANG=en_US.UTF-8 that the program couldn’t handle).

    I think Unix programs have also had to deal with somewhat more horrific encodings than Windows has ever seen fit to use, too, so that a mere up-to-4-bytes-per-codepoint encoding like UTF-8 is like child’s play by comparison. I certainly don’t remember any isdbcslead() function to give anyone the impression that it was safe to assume that no more than two bytes in a row would be needed to represent a character…

    1. Oops, this was meant to be a reply to Raymond’s When the user hits the left or right arrow key, or hits Backspace or Delete … comment; I guess I lost the relevant query parameter when I logged into my account :-(.

Comments are closed.

Skip to main content